CHICAGO (GenomeWeb) – As the International Cancer Genome Consortium moves toward its goal of categorizing 25,000 tumor genomes by 2018, a cloud-based informatics tool to support computational research on this data is quickly maturing.
The Cancer Genome Collaboratory, a cloud hosted by the Ontario Institute for Cancer Research in Toronto, underpins a public infrastructure for analysis of cancer genome sets. Funded mostly by the Canadian government, it went into general production in June following three years of development and testing.
This infrastructure now boasts 2,600 computing cores and 8 petabytes of storage, according to Lincoln Stein, head of adaptive oncology at OICR, a professor of molecular genetics at the University of Toronto, and leading principal investigator for the Cancer Genome Collaboratory. Plans are to expand it to 4,600 computing cores and nearly 15 PB of storage.
The data set now has whole genomes on about 2,400 donors and another 2,000 exomes, comprising 650 terabytes of data, Stein added.
As of September, the collaboratory was supporting 31 distinct projects and 73 registered users from nine different countries. "These are exclusive of our internal users from the beta test period," Stein explained. During the test period, five PIs tested the evolution and heterogeneity of tumors, methods of variant calling, drug targeting, indexing, and compression on this infrastructure, Stein noted.
The Cancer Genome Collaboratory infrastructure follows the Global Alliance for Genomics and Health's application programming interface to move genomic data around from multiple sources, based on the needs of collaboratory participants.
"We have developed a simple-to-use, but fast and secure, data-transfer tool that imports genomic data from cloud object storage into the user's compute instances," OICR bioinformatician Junjun Zhang wrote in a poster presented at the Cold Spring Harbor Laboratory genome informatics conference early last month on New York's Long Island.
The poster said that the collaboratory had "successfully demonstrated interoperability" with an instance of The Cancer Genome Atlas data set at University of Chicago's Bionimbus Protected Data Cloud and several ICGC data sets hosted on Amazon Web Services.
The poster noted that the Pan-Cancer Analysis of Whole Genomes alone produced more than 800 terabytes worth of sequence alignments, variants, and interpretation data on upwards of 2,800 patients.
"A data set of this size requires months to download and significant resources to store and process," the poster said. "By making the ICGC data available in cloud compute form in the collaboratory, researchers can bring their analysis methods to the cloud, yielding benefits from high availability, scalability, and economy offered by cloud services, avoiding large investment in compute resources and eliminating time for download."
Prospective users must apply for Cancer Genome Collaboratory access on the project's website by describing their proposed research. Researchers also need ICGC permission to access the control tier of that organization's data, which generally takes a week or two, according to Stein.
"Once a researcher is approved to use it, they use it on a cost-recovery basis," Stein said. The collaboratory charges about one-third what AWS does for cloud access, though the program is still adjusting its pricing. "We're a not-for-profit, and that is basically supporting the cost of our hardware maintenance," he noted.
"The type of analyses that the data set has been used for is, for example, a very detailed survey of the amount of heterogeneity in two genomes," Stein said. The group working on this has found that more than 97 percent of cancer genomes have multiple subclones, he said.
Using clocklike signatures of mutational processes, the same group has been able to do "genomic archeology and figure out when the earliest mutations occurred" in a tumor, Stein continued. "In most patients, the changes are a decade prior to diagnosis. In some patients, you can go back 50 years and find the earliest mutations in the tumor," he said.
"I don't want the collaboratory to take credit for this, but the resource has been used to make these discoveries."
The collaboratory infrastructure also has shown promise for benchmarking mutation calling methods and supporting development of new methods to predict functional impacts of genetic mutations.
Stein said that internal collaboratory users have published 33 papers in the last year. "I'm expecting to see lots of publications arising over the next year," he said.
The system has the funding to operate for another year and a half, though OICR is looking for other public funding sources.
"At the end of our initial grant period, we're not going to be self-sustaining," Stein said. "We'll probably be sustaining about a quarter of our costs, but we have accumulated cost savings over the course of the project." He said they are working with Compute Canada, a nationwide network of high-performance computing centers, to assure that OICR can maintain the Cancer Genome Collaboratory over a longer time period.
"The project has been on time and under budget," Stein said. He also pointed to a "healthy inflow of users" but that the Cancer Genome Collaboratory could support more users because it is only at about 30 percent of capacity and is planning on increasing computing power.
"For users who need more compute resources than we have, we have made sure that the resource is compatible with the Amazon API so that a researcher who writes his code inside the collaboratory can move it to Amazon with little disruption, and we have created a series of software tools that allows the data sets to be moved back and forth among multiple clouds in a very transparent fashion," Stein said.