This is the third of three stories looking at the Cancer Genome Pilot proposals selected by the National Cancer Institute.
NEW YORK (GenomeWeb) – The system proposed for the National Cancer Institute's Cancer Genomics Cloud pilot initiative by a Broad Institute-led research team will make use of analysis infrastructure developed at the Broad, incorporate software for analyzing genomic data in parallel, and leverage an application programming interface that will connect software, data, and compute power.
Specifically, Gad Getz, director of the Broad's Cancer Genome Computational Analysis group and a co-principal investigator of the $7 million NCI contract, told BioInform that the system will provide a cloud-enabled version of the Broad's Firehose analysis infrastructure, and will make use of the API being developed and used by the Global Alliance for Genomics and Health's data working group. Getz is leading the Broad team together with Matthew Trunnell, chief information officer at the Broad, and Anthony Phillipakis, who is a researcher at the Broad and a cardiologist at Brigham and Women's Hospital.
The system will also use an open source resource called ADAM that was developed by University of California, Berkeley — efforts here will be led by David Patterson, a professor of computer science. ADAM provides a data format, APIs, and command line tools for analyzing genomic data in distributed environments. Meantime, investigators at the University of California, Santa Cruz will work on adapting and implementing the GA4GH API to the planned cloud system.
Like the other systems being developed for the NCI pilots, this system will have a protocol for users to upload their datasets and share them as well as to plug in and run their own internally developed algorithms and applications on the cloud. The partners are finalizing their decision on which specific cloud provider to use, Getz said. He said that they plan to use a commercial cloud but declined to state which specific vendor the partners are considering. The two other pilot contract winners, the Institute for Systems Biology and Seven Bridges Genomics, are using the Google and Amazon clouds, respectively.
According to a brief description provided on the developers' website, Firehose coordinates the flow of terabyte-scale datasets through various algorithms in a manner that supports research reproducibility, automation, and high throughput. It does so by version-tracking samples and algorithms and storing this and related information in a relational database management system.
Firehose also includes an API that gives users programmatic control of routine tasks; encapsulates data and algorithm parameters using abstract annotations, which enables it to run analyses in a "data-blind manner;" and encapsulates "jobs within an "abstract execution engine … [which] enables them to be transparently dispatched to a single machine or across many compute nodes." It returns analysis results in biologist-friendly reports that are organized like research papers complete with an overview and details of results, methods, and references; and these reports can be cited in the public literature with digital object identifiers.
Firehose was designed specifically to meet a need at the Broad for a system that could handle the large genomics datasets generated at the institute and would be accessible to the various stakeholders there — biologists, computational biologists, mathematicians, and so on — allowing them to analyze and use the data in their projects, according to Getz. It's been used to explore information from the Cancer Genome Atlas, among other datasets — the Broad is one of the TCGA Genome Data Analysis Centers (GDAC) and it was tasked with developing tools to help researchers process and integrate data and analyses.
On one hand, Firehose is a workflow management end execution software that, with a few clicks, sets up and runs complex workflows involving multiple operations and algorithms as well as delegates tasks to the different compute nodes in a cluster, Michael Noble, the assistant director for data science for the Cancer Genome Analysis at the Broad, explained to BioInform. A second component of the system is the actual data and the algorithms and tools that are used in the workflows and pipelines, and these can be swapped out to create versions of Firehose that support activities in different domains, he said
Noble manages the version of Firehose that's used by the TCGA's GDAC at the Broad, but there are at least three other instantiations of Firehose that are run on projects at the Broad, he said. One other version supports TCGA sequencing activities, another supports clinical research sequencing efforts, and a third is used to analyze data from the Genotype-Tissue Expression Project. For the NCI pilot, the researchers will be adapting the version of Firehose that's used within the TCGA GDAC, Noble said.
The goal now under the NCI pilot is to make those internally available capabilities accessible to the broader cancer research community, Getz said. Adapting Firehose to run on the cloud, he said, will involve some recoding and revamping of the software to enable it to run in its new environment and to scale up as needed to support more users and different use cases.
Part of the adaptation process will involve getting Firehose to run in a more abstract fashion that takes advantage of the much larger set of resources available in cloud infrastructure, Noble said. The partners will also work to free the system from any existing dependencies on local file systems and directories that may have crept in over time as the system has been used at the Broad so that it can run independent of the environment in which it resides.
Researchers at UC Santa Cruz, meanwhile will work primarily to adapt and implement the GA4GH API to the planned cloud system as well as standardize the ways in which genomes and complex variation such as structural variants, are represented, David Haussler, a professor of biomolecular engineering at UC Santa Cruz and the co-chair and co-founder of the GA4GH's data working group, told BioInform. Haussler also led a team that worked on a separate but related NCI-funded initiative called the Cancer Genomics Hub, a data repository at UC Santa Cruz that was created to house data from TCGA, the Therapeutically Applicable Research to Generate Effective Treatments program, and other projects.
Additionally, as part of the NCI pilot, Haussler and colleagues involved with the GA4GH's data working group and its associated task teams will work with all three cloud pilot developers to ensure that each group's system is able to communicate with the others and with the Genomics Data Commons — a separate but related NCI-funded initiative operated by the University of Chicago — and share data and results. Google already has implemented the GA4GH API and other providers plan to implement it soon, according to a presentation by the data working group at the GA4GH plenary meeting on Oct. 18.
The Broad-led team will work with the required 2.5 petabytes of TCGA data, as required by the NCI, as well as with the microRNA and methylation sequencing datasets and clinical data collected as part of the projects, Getz said.