NEWYORK (GenomeWeb) – In his presentation at the Bio-IT World conference last week, Lincoln Stein, director of informatics and bio-computing at the Ontario Institute for Cancer Research and a professor of molecular genetics at the University of Toronto, discussed efforts to build the Cancer Genome Collaboratory, a cloud-based infrastructure that will host genomic data collected by the International Cancer Genome Sequencing Consortium (ICGC) and offer bioinformatics tools for analysis.
Last week, the Canadian government said that it will provide C$7.3 million ($6.7 million) through the Natural Sciences and Engineering Research Council of Canada and its partners — Genome Canada, the Canada Foundation for Innovation, and the Canadian Institutes of Health Research — to support the Collaboratory's development. In addition, the University of Chicago is providing $500,000 as well as computing resources for the project.
This funding is expected to last for five years and will support the establishment of two data centers at the University of Toronto and the University of Chicago, Stein told BioInform this week. The ICGC plans to begin testing the infrastructure next year with plans to make it more broadly available to the research community in 2016.
Participants in the project will use these funds to build centralized infrastructure that will host the genetic data coming from the ICGC consortia as well as computing resources that the research community can use to mine and analyze data from 25,000 patients from around the world. The consortium's aim is to characterize tumor genetic data from 500 patients for each of the major cancer types over 10 years. As of February this year, its portal contained data from more than 10,000 donors — specifically 4.12 million somatic mutations, 49,000 CNVs, and 6,341 methylation profiles — who have contributed data to 42 projects at 18 sites. Raw sequencing reads from the project are available from the European Genome-Phenome Archive, which is maintained and run by the European Bioinformatics Institute — interested users can apply for access though the ICGC's Data Access and Compliance Office (DACC).
When the ICGC project wraps in 2018, it expects to have collected an estimated 10 to 15 petabytes of information from more than 50,000 genomes — that includes data from both tumor and normal samples. This will all be hosted by the Cancer Genome Collaboratory beginning with genomic information from about 2,000 pairs of whole-genome tumor/normal samples that are being studied as part of one ICGC study dubbed the Pan-Cancer Whole Genome Analysis (PAWG) project, in which researchers are studying activity in the non-protein coding portions of the cancer genome, which makes up about 95 percent of the tumor genome. There are 130 research projects and 16 working groups involved in the PAWG in areas such as novel mutation calling, structural variations, clinical translation, evolution and heterogeneity, and more. So far, the researchers have collected data from about 1,000 PAWG patients.
The PAWG project is expected to generate about 500 terabytes of read data and about 100 gigabytes of variant call data, Stein said in his talk, and these data are initially being housed at six cloud compute centers maintained and run by teams at the University of Chicago; the German Cancer Research Center; the European Bioinformatics Institute; the Barcelona Supercomputer Center; Japan's Institute of Medical Sciences and RIKEN; and South Korea's Electronics and Telecommunications Research Institute.
Besides aligned raw reads from whole genomes and exomes, these centers will also host information from RNA and bisulfite sequencing as well as inferred data from these experiments such as normalized expression levels and splicing patterns, Stein said. Also included will be clinical data from the donors, as well as the specimens collected for analysis and samples derived from these specimens, for example cell lines from primary tumors, he said. The centers will also provide access to mutation information including germline variant calls where available and possibly array data.
This data will eventually be moved in the Collaboratory and will be available when the infrastructure goes into testing roughly this time next year. And when the portal is made more broadly available, researchers will be able to access the information through the existing channels for the ICGC data. Currently, data is stored in two tiers, the first of which is a public tier where somatic mutations, clinical data, sample data, and pathology data is hosted. A second more controlled access tier contains information that could be used to identify project donors and is only accessible to users authorized by the ICGC DACC.
Also available at launch will be bioinformatics tools and applications developed for use in the PAWG. These will be available as virtual machines and will provide applications for alignment, variant calling, pathway analysis and more, Stein said. Also available will be tools for exploring more specific research questions, for example looking at tumor heterogeneity and also for predicting the functional consequences of variations. Furthermore, the Collaboratory developers have set up a team to work on security measures. They expect to develop, among other things, techniques to make genetic profiles anonymous without losing relevant details, as well as methods of structuring research queries to ensure that they can be processed through secure storage sites. The goal here, Stein said, is to ensure that the infrastructure obtains a Federal Information Security Management Act moderate rating and complies with HIPAA in terms of things like authenticating users.
The ICGC plans to have the Collaboratory tested next year by researchers who access and use large quantities of genomic data regularly to provide the ICGC with a realistic idea of what sort of capacity would be needed to support heavy users, the point being that if the system can meet the needs of these power users then it should be able to support anyone in the larger community when it opens up in two years. The list of testers includes researchers at the University of California, Santa Cruz, the Broad Institute, the Sanger Institute, University of British Columbia, and McGill University.
The beta test is also an opportunity to determine how much it costs to operate the Collaboratory and to establish a pricing scheme that will support the project for the long haul, Stein said. By the time the funding provided for the first five years runs out, the ICGC hopes that the Collaboratory will have adopted a self-sustaining business model where it is able to provide storage and services to the community, allowing researchers who want to use the data to download what they need for their projects at no cost, and charging researchers who take advantage of the cloud resources for the hardware they use.
In preparation for the ICGC data, researchers at the University of Toronto are modifying their existing infrastructure — which currently utilizes Open Stack and tools from Sage Bionetworks — to include a metadata database, which was developed in house for tracking files and donor sample information, as well as exploring multiple options for file transfer protocols, Stein said.
The second center at the University of Chicago will use the existing Bionimbus cloud infrastructure to host its share of the data. It's also through this connection with the University of Chicago that ICGC intends to make information from the Cancer Genome Atlas project — with the appropriate authorization from the TCGA DACC — available as part of the Collaboratory, Stein told BioInform.
The University of Chicago is an authorized redistributor of TCGA data — the information can only be redistributed from US-based centers for regulatory reasons — and researchers are hoping to be one of three sites selected to host a Cancer Genomics Cloud pilot. That’s an initiative funded by the National Cancer Institute to build sustainable computing infrastructure to access and analyze genomic and related data from large-scale cancer research projects like TCGA — the agency began accepting proposals in January and expects to issue the awards by September this year.
The ICGC has no plans at present to build additional data centers.