This is the first of three stories looking at the Cancer Genome Pilot proposals selected by the NCI.
NEWYORK (GenomeWeb) – Researchers at the Institute for Systems Biology and their partners from Google and SRA International have received a roughly $6.5 million contract from the National Cancer Institute to develop one of three sets of infrastructure for the Cancer Genomics Cloud pilots, an NCI initiative to build sustainable computing infrastructure for accessing and analyzing genomic and related data from its funded research projects.
The ISB-led team, who are proposing a system based on Google's cloud, is one of three groups that were selected to receive cost reimbursement contracts to develop cost-effective, sustainable cloud-based compute and storage systems that address the limitations of current infrastructure used to manage and analyze data from large-scale NCI-funded projects.
In addition to the contract awarded to the ISB-led team, a proposal from the Broad Institute partnering with the University of California, Santa Cruz was accepted and awarded over $7 million to develop their planned system. The third contract went to Seven Bridges Genomics, whose funding is just shy of $5.9 million. These amounts include base costs and all options, according to the grant announcement. Also, the final allocations were made based on cost estimates provided by the respective applicants in their proposals.
For their part, ISB is proposing a system built on the Google cloud infrastructure that will offer both programmatic and web-based access to the data, Ilya Shmulevich, an ISB professor and principal investigator on the NCI contract, told BioInform. This project extends an existing relationship between the ISB and Google that dates back to 2012 when Google first rolled out its Google Compute engine. Google tapped the Shmulevich group to evaluate the infrastructure's ability to handle life science computing requirements and adapted software his lab had developed to analyze TCGA to run on the newly minted system.
For this new endeavor, the partners plan a platform based on Google's cloud that leverages Google Genomics' application programming interface — which also includes an implementation of the Genomics API developed by the Global Alliance for Genomics and Health's data working group — for storing, processing, querying, exploring, and sharing data. It will also have a tractable web interface through which less informatics-savvy researchers can interact with and explore the data, he said. Researchers will also use scalable, reliable virtual machines and storage infrastructure provided by Google as well as its familiar collaboration resources and services including Google Docs and Google Hangout.
The third partner in this triad, SRA International, will contribute security, testing, and documentation to the pilot, Shmulevich said. SRA's expertise in these areas is gleaned at least in part from its years of working on multiple federally funded projects including The Cancer Genome Atlas. In an email to BioInform SRA's Senior Director of Bioinformatics, John Greene, said that his firm "looks forward to using our deep knowledge of the TCGA data acquired over the last seven years of running the Data Coordinating Center for that project to help Ilya's strong team … demonstrate the increasing value of doing large-scale biological data computations on a public cloud platform." David Pot, SRA's director of bioinformatics, has already begun "assembling our part of the team to deal with security and testing," he added.
Furthermore, researchers will also be able to upload their own private datasets and explore them in the context of the larger public information that will be available in the cloud. ISB and its collaborators intend to include not just the TCGA core data in their infrastructure but also all the orthogonal data types including gene expression and clinical data as well as data from the 1000 Genomes project and the GlaxoSmithKline cancer-cell-line data set, Shmulevich said.
For the purpose of the pilots, participants are only required to show that their systems can handle the TCGA's 2.5 petabytes of data plus one orthogonal data type, but ISB and its partners intend to provide a much richer and more comprehensive resource for the community to try out. Google's cloud is certainly capable of handling that much data and more; and ISB routinely processes large quantities of data of different kinds in its capacity as one of the TCGA's Genome Data Analysis Centers, Shmulevich noted, so they are well equipped to design a system that meets these requirements and scales as needed.
The NCI's board of Scientific Advisors and the National Cancer Advisory Board first approved the Cancer Genome cloud pilots in June 2013 following a detailed presentation that explained the concept delivered by George Komatsoulis, who at the time was NCI's chief information officer and interim director of its Center for Biomedical Informatics and Information Technology. Anticipating petabytes of data from the TCGA and similar projects and responding to data access and use barriers such as limited local compute and protracted download times, the agency set out to create a communal resource that addresses these issues by providing co-located computational capacity and storage as well as APIs that connect software, data, and compute resources.
In January this year, the NCI issued a broad agency announcement (BAA) that laid out in more detail information about the pilots' research and technical objectives, architecture and eligibility requirements, and proposal expectations including budgetary requirements. The institute also hosted a conference call and webcast that allowed members of the academic and commercial communities to give feedback on the document and ask questions. The release of the BAA officially launched the six week proposal-collection phase for the pilots, and at the time, the NCI estimated that it would spend approximately $20 million on the three or more contracts.
In response to the BAA, the NCI received "many strong proposals," Anthony Kerlavage, branch chief of informatics programs at the NCI's Center for Biomedical Informatics and Information Technology, told BioInform last week, but the three that it ultimately selected were "superior to the others in our technical evaluation and in our opinion of the ones that would bring the best value to the NCI."
The authors of the winning proposals now have six months to complete their initial designs and begin developing their platforms. After that, they'll move into one of two nine-month option periods. During the first, they have to complete and implement their systems. By the time the second one gets underway, the systems should be fully operational and ready to be evaluated by the NCI and the broader cancer research community. In total, development, testing, and evaluation of the clouds should take about 24 months.
Because each proposal adopts a unique infrastructure development approach and leverages different solutions, these pilots provide a golden opportunity to assess multiple alternatives in tandem and to see which single system or which combination of systems best democratizes access to NCI's datasets and is sustainable in the long run, Kerlavage said. In addition, there's an opportunity to start exploring mechanisms for making related NCI informatics initiatives interoperable, for example, linking the Cancer Genome clouds to the Genomics Data Commons (GDC), he noted — the grant for developing the GDC was awarded to the University of Chicago in May this year. Interoperable infrastructure is also a major part of the NCI's informatics strategy so as part of this process, all four teams may contribute to efforts to define common application programming interfaces that link data and infrastructure across sites, he said.
Shmulevich expressed similar sentiments about working with the other awardees on the interoperability issue and added that his team also intends to engage the broader cancer research community in its efforts. To that end, to encourage participation during the evaluation phase of the pilots, the ISB team will give away free cloud credits — using a quota system — for compute and storage to the community that they can use to put the system through its paces. "That's the real test [of whether] this is going to be useful infrastructure," he said.