NEW YORK (GenomeWeb) – The Broad Institute and Google Genomics are partnering to provide cloud-based compute, tools, and infrastructure that they believe will help researchers in the genomics community better store, process, and analyze the large quantities of data their projects generate.
One of the first steps of the partnership is to make the Broad's Genome Analysis Toolkit software available as a managed service on the Google Cloud. The partners are releasing an alpha version of the service this week for testing by a limited set of users. This first iteration will offer access to the core pieces of the GATK infrastructure — best practice pipelines and tools, along with full documentation, for processing human DNA sequence from FASTQ or BAM files all the way through to VCF files — with additional pieces of the system possibly available at a later date.
This arrangement is intended to make the well-known software available for data analysis while eliminating the accompanying hassles of downloading and maintaining the infrastructure internally. Although the Broad has a team and mechanisms in place to address questions and meet needs associated with using the GATK, researchers still need significant internal expertise in order to actually run the tools, Eric Banks, director of data science and data engineering at the Broad, told GenomeWeb. The cloud also increases accessibility to the tool, he added. The significant hardware requirements for running the GATK at scale may make it difficult for smaller research laboratories that lack dedicated compute infrastructure. But with GATK in the cloud, virtually anyone with an account can access it, he said.
There are no specific requirements for participating in the GATK alpha. Interested researchers can simply sign up to try out the system on the Google Genomics site. They'll be put on a waitlist and Google will work its way down that list, gradually granting access to the platform, gathering user feedback, and using those ideas and suggestions to improve the service ahead of a beta round of testing followed by a full launch at a yet-to-be-determined date, David Glazer, director of Google Genomics and Cloud Platform, told GenomeWeb.
Alpha testers will have to pay Google's standard prices for cloud compute and storage capacity but access to the GATK service itself will be free for the duration of the testing period, Glazer said. There will, however, be a cost associated with using the service when it becomes commercially available, he said. The partners are currently mulling exactly what that pricing structure will be. Furthermore, there will be no differences between the cloud-based and local versions of the GATK. Banks told GenomeWeb that the partners are currently discussing mechanisms to ensure that updates made to the local version of the tool are also included in the cloud service.
Meanwhile, this arrangement will not affect existing access mechanisms to the GATK, the Broad said. The tools and source code will continue to be available for download at no cost to academic and non-profit users; and for-profit businesses will continue to be able to purchase licenses to the GATK directly from the Broad. The institute began handling its own commercial licensing and support for GATK after wrapping up a pre-existing arrangement with cloud infrastructure vendor Appistry for that purpose last April. At the present, it determines pricing for licenses on an individual basis but plans to provide more concrete details for users in the months ahead.
Part of the impetus for this partnership is simple proximity — Google has an office located across the street from the Broad — but more importantly, both partners see this as an opportunity to merge strengths in computing and infrastructure with years of research experience and a solid understanding of the genomics arena. Increasingly, "the worlds of data science and life science are coming together and there is a growing opportunity and need [to marry] expertise in working with large amounts of information [and] … working with the actual biology and health," Google's Glazer said. "Putting those together is what's going to lead the next round of breakthroughs and productivity across the whole field."
With an eye towards participating in that union, Google launched the Google Genomics platform in 2014 seeking to use its skills and capabilities to provide effective systems for managing, organizing, and computing on the large, complex datasets.
So far, the company has made significant inroads into the community, bagging a number of high-profile projects within the first year of its existence. In addition to the current partnership with the Broad, Google is also working with the institute's collaborators at the Universities of California in Berkeley and Santa Cruz to develop FireCloud, one of three systems selected for the National Cancer Institute's Cancer Genomics Cloud pilots, an NCI-funded initiative to build sustainable computing infrastructure for accessing and analyzing genomic and related data from its funded research projects. Google is also the platform of choice for the system being developed by researchers at the Institute for Systems Biology, also for the NCI initiative. The only other NCI pilot system, being developed by Seven Bridges, uses Amazon infrastructure. Google Genomics is also involved in the Global Alliance for Genomics and Health and in fact its application programming interface is an implementation of the Genomics API being developed by the GA4GH's Working Group.
Google also partnered with Autism Speaks to create an open database of genomic information on autism spectrum disorder that will be available via the Google Cloud platform. The data comes from the organization's 10,000 Genomes Project, an effort to sequence the whole genomes of 10,000 individuals in families affected by autism. The database, which is now live in the Google cloud, currently contains about 850 genomes, Glazer said. The current goal is to have all 10,000 genomes uploaded to the database by early next year. Bioinformatics consultancy BioTeam has developed a portal to the database that offers easy access to the autism data for biologists with no bioinformatics experience, but includes standard bioinformatics tools that more experienced researchers can use to explore the data, such as tools from the R statistical software package, Glazer noted. Those users can also use Google APIs to write and connect their own analysis tools to the data.