SALT LAKE CITY (GenomeWeb) – Google is using its collaboration with Autism Speaks and autism researchers as a test case to demonstrate how its resources can make large human genomic datasets widely available for analysis, and plans to support similar projects in the future.
In a session at the American College of Medical Genetics and Genomics annual clinical genetics meeting here last week, David Glazer, a director at Google Genomics, provided an update on the MSSNG project — a collaboration between Autism Speaks, The Hospital for Sick Children in Toronto, and Google — as well as a glimpse into Google's future genomics plans.
The goal of the MSSNG project, previously known as The Autism Speaks Ten Thousand Genomes Program (AUT10K), is to make whole-genome sequencing data from autism families and rich phenotype data from autism patients available in a single database to qualified researchers, using the Google Cloud platform and Google Genomics.
Autism Speaks originally announced the partnership with Google last summer and launched the MSSNG project in December. In conjunction with a publication of a study in Nature Medicine by Stephen Scherer and colleagues from The Hospital for Sick Children in January, the collaborators uploaded genomic data from the first 1,000 or so samples to the MSSNG database. The goal is to increase the resource to include data from more than 10,000 individuals from autism-afflicted families.
Researchers wanting to access the genome sequence data need to apply and sign an agreement, and the first set of investigators was approved last week, Glazer said.
The cost of storing and hosting the genomic data on Google Cloud is shouldered by Autism Speaks, he said, at a cost of $25 per year for a typical human genome, or about 100 gigabytes of data.
Researchers are billed for running queries on the data, but not for loading and exporting datasets. According to the Google Genomics website, calling variants in a set of 500 million reads, for example, requires one million application programming interface (API) operations, which costs $1.
Glazer explained that the Google Genomics platform caters to three categories of researchers, all with different interests in the data.
For biologists, who might be interested in questions such as which subjects carry a certain variant, and what their phenotypes are, Google will provide a searchable web interface that is "not data heavy" but does allow researchers to go back to the raw reads. The interface is currently being built by the BioTeam, a bioinformatics consultancy, Glazer said.
For bioinformatics researchers, who "love data and statistics," Google offers interactive analytics using R, a popular programming language for statistical computing. Users can run queries and generate data visualizations in R, and run "very quick analyses" that only take a few seconds, he said.
Finally, bioinformatics programmers who want to program their own analyses or write their own software are able to do so. Google offers both open source tools and allows for custom projects, for example principal coordinates analyses, which are very compute-intensive. Google has used data from the 1,000 Genomes Project as a test set to validate a number of analyses, he said.
According to Glazer, Google regards the MSSNG project as the first branch of a "tree of life" where each branch comprises a large dataset of human genomes from a specific cohort.
Once the "mechanical parts" of making large datasets available for analysis are worked out with the MSSNG database, Google is interested in launching additional projects for other conditions with similar types of datasets, he said.