NEW YORK (GenomeWeb) – With its launch of the Google Genomics Platform earlier this year, Google is seeking to bring its considerable infrastructure, data management, and storage capabilities to bear on a perceived need in the biomedical research community for more effective systems for managing, organizing, and computing on the large, complex datasets that characterize the domain.
Google has dipped its toe into the genomics arena at least once in the past. In 2012, the company partnered with bioinformatics researchers at the Institute for Systems Biology to evaluate the utility of the Google Compute Engine — its cloud computing infrastructure — to handle life science computing requirements. Then, observing the rapid growth in the sequencing arena and sensing an opportunity to expand its expertise in a new business area, Google set up a dedicated genomics unit a little over a year ago and tasked the team to seek out the best ways to use tools that it had developed and currently uses in other Google divisions to provide capabilities for analyzing, querying, and managing genomic data that are optimized for the company's cloud infrastructure, Jonathan Bingham, product manager, Google Genomics, told GenomeWeb.
Bingham's group is responsible for developing the Google Genomics platform, a web-based application programming interface (API) that is built on Google Compute Engine and other infrastructure owned by the internet services provider. The Google API is an implementation of the Genomics API that's being developed by the data working group of the Global Alliance for Genomics and Health (GA4GH), which has been optimized to run on the Google cloud. Google joined the alliance in March this year.
Bingham's team launched an alpha version of Google Genomics in February this year and now offers a beta version of the system to customers. Essentially, the platform enables users to store, process, explore, and share reads, alignments, and variant calls using the Google general purpose cloud infrastructure and other Google-built technologies such as BigTable — used in applications such as Google Search, Google Maps, and Gmail; and Dremel, a Google-developed distributed system for querying very large datasets. "What we are offering is a platform for genomics which is higher level than just infrastructure-as-a-service," Bingham explained. "The goal is to provide a higher level interface" that takes the "general purpose capabilities that exist in [the] Google cloud platform as well as one that exists within Google in support of our other services … and bring those to bear on genomics."
Google Genomics is set up to handle raw and aligned sequence reads as well as variant calls, and the team is working on adding phenotypic information to the mix, Bingham told GenomeWeb. In terms of tools, the system offers easy access to the Broad Institute's Genome Analysis Toolkit's best practices pipeline, which uses the Burrows-Wheeler Aligner for sequence alignment, and the Haplotype caller for variant calling. It also includes access to GA Browse, a visualization application that displays publicly available data from sources such as the National Center for Biotechnology Information and the European Bioinformatics Institute.
Google also hosts and offers multiple access options to data from the 1000 Genomes Project, the Illumina Platinum Genomes, and reference datasets. Also included in the system are tools for running data searches, batch processing, and more. As has always been the case, users have the option to create and run their own bespoke pipelines as virtual machines directly on the Google Cloud.
Bingham also said that his team intends to work with the GA4GH alliance to determine the sorts of datasets to support in its infrastructure moving forward. However, if customers have specific kinds of data that they're interested in analyzing using the infrastructure, Google's team is willing to work with them to explore those datasets as well, he added.
Besides alignment and variant calling, Google Genomics lets users compare genetic cohorts and compute statistics such as transition/transversion ratios and allelic frequencies. The system also supports researchers that want to use external statistical programs such as R in their analysis and also lets researchers write their own code to run operations such as principal component analysis and the Hardy-Weinberg equilibrium, according to Bingham. Researchers can share their data publicly, restrict access to close collaborators, or keep it to themselves — the default setting for access to all datasets is private. The company charges for storage and for queries and lists these prices on its site.
The Google Genomics platform is being used in at least one of three systems being developed under the auspices of the National Cancer Institute's Cancer Genomics Cloud pilots initiative. That platform is being developed by researchers from the Institute for Systems Biology in collaboration with Google and SRA International.
Also, in June this year, Autism Speaks announced that it will collaborate with Google to create a database of genomic information on autism spectrum disorder that will be an open resource for autism researchers and will be available via the Google Cloud Platform. The data is from the organization's 10,000 Genomes Program — an effort to sequence the whole genomes of 10,000 individuals in families affected with autism.
Bingham declined to disclose details about future plans that Google might have for the genomics space but indicated that the company believes it has much to offer to the biomedical domain as it tries to grapple with exponential data growth. "Having been through this exponential curve multiple times in other product areas, I think we are in a great position to be able to help the community," he said.
And, so far, the community has been "receptive" to Google's involvement in genomics and particularly its participation in the GA4GH, according to Bingham. "I think there is a lot of excitement that we are … making some of Google's infrastructure available to the genomics community," he said. Moving forward, "we are looking to work with experts in the field to figure out where those opportunities are so that we can leverage all that Google has done in the past and make that available to researchers to use in the space of genomics and bioinformatics."
He also sees companies that offer similar cloud-based infrastructure and analysis products, such as DNAnexus and Seven Bridges, as potential partners for Google Genomics rather than competitors for customers.