BOSTON (GenomeWeb) – The Broad Institute of MIT and Harvard today announced partnerships with Amazon Web Services, IBM, and Microsoft to implement the current version of the Genome Analysis Toolkit (GATK 3.5) software package on their respective cloud platforms later this year.
Last summer, Broad partnered with Google to offer cloud-based access to the GATK on the company's cloud and released an alpha version of the system for testing to the community at that time. The institute has now signed new agreements with the other cloud providers, but will also continue to make the software available for download for customers who prefer to run the solution in house. Broad is also partnering with Illumina to offer GATK on the BaseSpace cloud starting in late 2016.
Having GATK run on the different clouds offers users an alternative to installing the software locally and gives them a choice of computing platforms on which to run their projects.
"There are currently more than 31,000 registered users of the Broad Institute's GATK. The vast majority set up an extensive local compute and storage infrastructure to process the huge amount of information required to conduct genomic analyses," Eric Banks, Broad's senior director of data sciences and data engineering, said in a statement. "These collaborations will provide new options that can remove traditional barriers of scale while offering the same high level of data quality."
Users should be able to start accessing all of the cloud options later this year, the Broad said, though the exact release dates are yet to be determined. The first releases will include existing tools for analyzing germline whole genomes, Banks told GenomeWeb. They will eventually add more pipelines from the current iteration of the software to the cloud including ones for processing arrays, exomes, RNA sequences, and for somatic variant calling, he said.
So far, the partners have completed the GATK implementation on the Google cloud and an alpha version has been available to whitelisted users since last year — it is not clear when the software will be made broadly available. The Broad has also begun running jobs on the Google-based GATK and plans to move all of its analysis projects to the cloud this month. So far, the Google implementation has received a "tremendous amount of interest," David Glazer, Google Genomics director, said in a statement. It has been used by researchers at the Broad and elsewhere. "We have run many thousands of samples through this pipeline for a variety of users. We've also optimized the pipeline to make it remarkably cost effective."
The institute also is working with the other cloud partners to get the software implemented on their platforms. "What we are announcing now is what we expect will be the beginning of something that will grow over time," Lee McGuire, Broad's chief communications officer, told GenomeWeb. "We are creating a platform that can be adopted across different types of clouds. We imagine there will be different offerings down the road. This is just what we are working on right now."
Pricing details are still being discussed and the exact costs will vary depending on the cloud provider. The Broad will also continue to offer GATK as an on-premise solution to existing and new users who prefer to download and deploy the system on their local infrastructure. The on-premise tool is free for academic users while commercial clients have to pay for licenses. It is not clear whether or not cloud customers will need licences for the software. McGuire told GenomeWeb that Broad does not expect to require a license since users would not actually be downloading or installing GATK locally but all those details are still being determined.
Working on version 4
Meanwhile, the Broad is also working with Cloudera, Intel, and Google to build the next iteration of the GATK — version 4 will also be made available on the cloud when it is completed. Specifically, they are developing two versions of the tool, one of which will be based on the Apache Spark distributed computing framework. The Spark-based GATK, which Cloudera is developing, will make it easier for users to parallelize genomic analysis tasks. "We investigated a lot of other possibilities for the underlying infrastructure and Spark was really just a winner," the Broad's Banks told GenomeWeb. "It is the most scalable and parallelizable."
The partnership with Cloudera is an extension of an existing relationship between the company and the Broad which is a customer of Cloudera's Enterprise software platform.
Also planned for GATK4 is new functionality for identifying structural and copy number variation in cancer. This will include new functionality for calling somatic single nucleotide polymorphisms and insertions and deletions, Banks said. They have also developed a tool for calling copy number variants in somatic exome sequences, and are working on one for calling copy number variants in somatic whole genomes. In addition, the partners are developing a tool for structural variation copy number calling and one for identifying inversions, he said.
Besides contributing to efforts to develop a Spark-based implementation of GATK, Intel is working with the Broad to optimize the performance of GATK4, Ketan Paranjape, Intel's general manager of life sciences, told GenomeWeb. He unveiled a series of tools that the company has developed with the Broad at the Bio-IT World Conference being held in Boston this week.
Intel worked with the institute last year to launch an optimized version of the current generation of the software and were able to make it run 40-50 times faster on Intel machines than was previously possible. The company is also helping Broad implement the current version of the GATK software on the different cloud platforms, he noted.
To help simplify the task of executing the GATK on different clouds, the Broad and Intel worked together to extend the capabilities of the Broad's workflow execution engine, called Cromwell, which is designed to help researchers launch genomic pipelines on private or public clouds in a portable and reproducible manner. New features include support for multiple workflow languages as well as the ability to execute different analysis jobs on multiple platforms simultaneously. The engine is also able to select the most optimal routes for executing a given analysis task as well as the most appropriate hardware resources for running those tasks while avoiding redundant steps.
"We had to understand the different workflow languages and tweak them to create that framework where [jobs] could run seamlessly on these different clouds," Paranjape said, adding that the partners have now developed a standardized application programming interface that is able to communicate with the different clouds. "From a user perspective, [the] command line looks identical, and you can start [the process] off in any of the clouds," he said.
The Broad and Intel also worked on an improved method for storing and processing variant data called GenomicsDB. It is an implementation of an array database system co-developed by MIT and Intel called TileDB that is designed for holding sparse datasets in a form that makes them easier to analyze. TileDB was initially developed for use by the artificial intelligence community, but the developers saw a potential application in the genomics space, according to Paranjape.
"We took TileDB to Broad and started playing with some of their variant-calling pipelines," he said. In one instance, researchers were trying to perform variant calling on 8,000 samples at a time. While that task would previously have taken days to complete, with GenomicsDB they were able to complete it in a much shorter time frame.
According to Banks, the Intel-optimized tools have helped researchers at the institute better organize and process germline variant data. "Before GenomicsDB, we had to move files and data across networks to bring them all in-memory," he said. "It was a slow and difficult process. Intel designed this database that allows one to organize data better and smarter in-memory and more efficiently."
The partners are releasing TileDB, GenomicsDB, and Cromwell to the public as open-source tools in the near future, Paranjape told GenomeWeb. TileDB and Cromwell are already being used by researchers involved in the Collaborative Cancer Cloud, a system developed by Intel and Oregon Health Sciences University that is designed to help hospitals and clinical centers securely share their oncology datasets. Dana-Farber Cancer Institute and the Ontario Institute for Cancer Research have signed on to participate in pilot projects aimed at testing the efficacy of that system.