The newly launched International Cancer Genome Consortium, which plans to resequence around 25,000 cancer genome samples over the next 10 years, stands to generate an unparalleled amount of data — a fact that is leading its informatics team to employ a different kind of data-management solution than earlier large-scale genome projects.
Like in earlier international initiatives such as the Encyclopedia of DNA Elements, the SNP Consortium, and the International Haplotype project, the ICGC will have a dedicated Data Coordination Center. But unlike those projects, the ICGC’s DCC will not serve as a central repository for all the data from the project, but will instead oversee a network of federated repositories that each participating center will maintain on its own.
Leading the DCC for the project will be Lincoln Stein, director of informatics and biocomputing at the Ontario Institute for Cancer Research in Toronto, which serves as the ICGC’s headquarters.
Stein has experience with ambitious genome initiatives. In his previous position as a researcher at Cold Spring Harbor Laboratory, he led the data-management components of both the SNP Consortium and the HapMap project.
“However, the ICGC project is at least an order of magnitude more complex — and a couple of orders of magnitude larger in sheer size — than either HapMap or the SNP Consortium or, in fact, any of the projects that I’ve worked on,” Stein told BioInform this week.
“In talking this over with the various working groups, it was decided that a centralized data coordinating center really wasn’t going to be an effective or practical design decision,” he said. “So instead, what we are proposing is a distributed system in which each of the data-generation centers is really responsible for its own laboratory information-management system and its own database of results, but [will] put the information into a form where it can be combined with other projects, other data generators from the ICGC collaboration.”
Under this model, each of the participating centers will host a local “franchise” database that will share a common data model and structure. The DCC will provide the schema and software for these franchise databases, but each data producer will manage its own local database.
According to the ICGC, this model gives participants the flexibility to develop their own project-specific data models and workflows, but ensures that any data that is shared across the project will be consistent.
“At regular intervals, a subset of the information contained in the project-specific databases will be exported into a local ICGC franchise database, which will implement a uniform simplified data model that captures the essential data elements that are needed to implement ICGC-wide policies on data release, quality control, and milestones,” the consortium explains in a white paper outlining its goals and policies.
“What this lets us do is to have a front end, or in fact several front ends, for the community to use, which makes it appear as if all of the data is sitting in one database,” Stein said. “They can do distributed queries based on a set of genes, or a set of variations, or a type of a tumor, or a combination of histopathology and patient clinical information, and get the data in the same format, no matter which center generated it.”
This federated approach eliminates the need to aggregate all the data at the OICR, “which would be unfeasible, just due to the sheer size of it as well as the fundamental bandwidth limitations of the Internet,” Stein said.
Tim Hubbard, head of informatics at the Wellcome Trust Sanger Institute and a member of the ICGC data-management working group, told BioInform that the federated model “is kind of an indicator of the way bioinformatics is going and has to go, simply because of the data scale.”
Hubbard compared the scope of the initiative to another recently launched large-scale genomics effort, the 1000 Genomes Project, which is expected to generate around 60 times more data than has been placed in public repositories in the last 25 years, and that has already surpassed the amount of sequence data in Genbank.
For projects of this scale, he said, “you could make a completely new database, but you’d be replicating chunks of infrastructure that already exist,” he said. In addition, Hubbard noted, “If you build another database to put your data in when it’s identical to somebody else’s, they’re going to have to write two interfaces to interoperate with that. And when the data is at this sort of scale, it may be that they just can’t combine those datasets, simply because they can’t move them around.”
This amount of data raises another issue that Hubbard said the bioinformatics community will soon need to consider for projects like the ICGC and 1000 Genomes: “We’re probably going to have to move to systems where people can do remote compute near to the data, rather than bring the data to them, at least for some aspects of slicing the data.”
A ‘Roadmap’ for Harmonization
The ICGC, which officially launched two weeks ago, serves as an umbrella organization for existing and future cancer genome projects worldwide. Its goal is to harmonize these projects to avoid duplication and ensure that researchers around the world can make the most of the data.
Some projects, like the Wellcome Trust Sanger Institute’s Cancer Genome Project or the National Institute of Health’s Cancer Genome Atlas are already underway. These initiatives will continue their work but will contribute their data to the ICGC and follow its guidelines.
Other participating members include the OICR, the Chinese Cancer Genome Consortium, France’s Institut National du Cancer, India’s Department of Biotechnology, Ministry of Science & Technology, Japan’s National Cancer Center, and the Genome Institute of Singapore.
“[T]he ICGC project is at least an order of magnitude more complex — and a couple of orders of magnitude larger in sheer size — than either HapMap or the SNP Consortium.”
Each consortium member is responsible for choosing at least one specific type or subtype of cancer and conducting a “comprehensive, high-resolution analysis of the full range of genomic changes in at least one specific type or subtype of cancer,” the ICGC said in a statement. Each of these projects is expected to require at least 500 patient samples and cost around $20 million. The ICGC is compiling a list of around 50 cancer types and subtypes that participants should focus on.
The consortium plans to catalog a range of genomic mutations in these cancer types, including SNPs, insertions, deletions, copy number changes, translocations, and other chromosomal rearrangements. Centers will also generate gene-expression and DNA-methylation data, and will have the option to perform other types of analyses, including proteomic, metabolomic, and immunohistochemical.
Participants are expected to perform whole-genome sequencing on these samples when sequencing technologies are shown to be “robust and affordable” — a milestone that the ICGC expects to see in two to five years. In the meantime, centers are asked to sequence coding exons and other genomic regions of “particular interest for point mutations,” and to use high-density genotyping arrays to determine copy number, loss-of-heterozygosity, and breakpoint information.
OICR’s Stein stressed that the ICGC isn’t a monolithic cancer genome project, but rather is developing “a roadmap for a whole series of cancer resequencing projects that will come out in a staggered way.” He said that the consortium, “at its core, is an agreement reached among working groups … from every participating center on uniform standards for the data collection and processing.”
As a result, he said, “there is a minimum set of criteria that a data set has to meet in order to become part of the ICGC.”
For example, he said, participants will need to provide histopathological confirmation that a specimen contains a tumor, as well as tumor and matched normal tissue from each patient. In addition, he said, “both the normal and tumor specimens have to be characterized by microarray-expression analysis, SNP genotyping for copy number variation analysis, and gene-based resequencing.”
Some of the existing cancer genome data sets, such as the TCGA data, “meet some but not all of those requirements,” Stein said. “So in order to retrofit those into ICGC, those projects will need to go back and do the initial characterizations that are needed.”
Officials from TCGA could not be reached for comment before press time.
Building on Existing Tools
In order to give users the impression that they are viewing all the distributed ICGC data via a single portal, the DCC will develop a “backend” database that will use the same data model as the individual franchise databases.
“This effect can be achieved either via a physical mirroring process in which the coordination backend pulls in copies of each of the franchise databases at regular intervals, or via a pass-through system in which queries directed at the coordination backend are multiplexed among the individual franchise databases,” the ICGC whitepaper states.
Stein said that his team is still “prototyping various architectures” for this system to make sure that it has the performance characteristics that the project will need. He said a first version of the portal should be available for ICGC researchers in about a year and a half.
The backend database will likely be built on top of the BioMart data-integration system, which was developed by Arek Kasprzyk at the European Bioinformatics Institute. Kasprzyk has recently joined the OICR as director of bioinformatics operations and will be handling most of the “day-to-day work” on the ICGC DCC, Stein said.
Sanger’s Hubbard cited BioMart, as well as the Distributed Annotation System originally developed by Stein, as successful examples of the federated model.
“Federation is hard to do, but there are existing federation systems out there that we know work, so we know it’s possible,” he said. “It might be slightly harder work initially than just gathering the data in one place, but it’s a lot more scalable.”
The ICGC project will also try to use whenever possible “off-the-shelf visualization tools including GBrowse and Ensembl,” Stein said, but noted that the project will also require some custom development.
“Among other things, we’ll need a way of querying the patient database … so that we can pull out molecular characterization of specimens that differ in terms of clinical presentation — [for example,] aggressive tumors versus nonaggressive tumors, chemotherapy-responsive tumors versus non-responsive tumors,” Stein said.
The DCC will also need to ensure that the ICGC portal effectively manages authorization. In an effort to reduce the risk of patient identification, datasets will be organized into two categories: open and controlled-access. Open-access datasets contain only data “that cannot be aggregated to generate a dataset unique to an individual,” and will be publicly accessible, ICGC said in its whitepaper.
Controlled-access datasets contain genomic and clinical data that are associated to a unique person and will only be available to researchers who agree not to try to identify or contact donor subjects, and agree not to redistribute the controlled data.
“Each country has its own standards for reviewing researchers’ bona fides and having them sign that contract, because obviously the standards are going to differ from country to country,” Stein said, “but once the researcher is approved by any of the governments then they have access to the whole set.”
The DCC will also coordinate quality-control checks of ICGC data. At its heart, Stein said, the ICGC initiative is an effort to ensure that cancer genome data in the public domain has been vetted in some way.
“The principle is that there is already a huge amount of variation between tumors, even in the same major tumor type. There’s great heterogeneity even within a single tumor itself. And we don’t want to confound that by contributing variation that’s due to which center analyzed it,” he said.
The details of the QC process are still under discussion, but Stein said there will be built-in quality-control tests at each center, such as ways to confirm that a specimen marked as “male” has Y-specific markers in the genome data.
In addition, he said, “there will be exercises in which the same specimen is analyzed by two different centers and the results are compared to see what the inter-center variation is, and there may well be a center that does nothing but QC and re-analyzes a certain percentage of the specimens that are done by other centers as a quality check.”
Stein stressed that the project “is very much in its early stages” and is likely to face a few speed bumps over the next 10 years. “I’m sure that we’ll look back at this five years from now, and a lot of our assumptions will have turned out to be wrong. So you can expect there to be a lot of mid-course corrections, both on the informatics side and on the wet lab side,” he said. “I’m expecting the unexpected.”