NEW YORK (GenomeWeb) – Autism Speaks has unveiled the first iteration of a web portal developed in conjunction with BioTeam to provide easy access to data collected under the auspices of the MSSNG project, an ongoing effort between Autism Speaks, The Hospital for Sick Children in Toronto, and Google.
Autism Speaks announced the MSSNG project — formerly named the Austism Speaks Ten Thousand Genomes Program — in June 2014, seeking to create an open repository of whole genome sequence, phenotype, and clinical information from 10,000 individuals and families with autism and make it available not just to researchers in academia and industry but also to a crop of other potential users including clinicians, genetic counselors, and families.
The organization tapped Google's genomics arm to provide a platform for storing and querying the data that would be collected as part of the project. It then tapped BioTeam to design a user-friendly interface that would give more casual users the ability to query and explore the MSSNG data.
The portal, which goes live this week, is "a pretty critical step in reaching our goal of diversifying the type of individual that can access this data," Matthew Pletcher, vice president of genomics for Autism Speaks, told GenomeWeb, offering a simpler data access point to the more specialized infrastructure that Google is providing for the platform. Through the portal, genetic counselors, for example, can run simple queries to find existing information about variants found in their patients including details about functional outcomes, associated clinical phenotypes, and more, he said.
Reaching a broader audience is of particular import to the autism advocacy organization, which hopes MSSNG will help change the ways that ASDs are currently discussed and diagnosed. "What we are really after here is to begin to reframe this umbrella diagnosis of autism [and] to use this data to begin to define genetically-defined subtypes of autism," Pletcher said. As an example, he highlighted the CHDB gene; researchers at the University of Washington have identified a variantion in the gene that is associated with some fraction of autism cases. Because of UW's research, "there is now an understanding of the specific health risks that are associated with that particular genetic diagnosis that [the general ASD designation] did not provide" he said. "On top of that, especially from the perspective of an advocacy group like Autism Speaks, it has provided the opportunity for the families who have the same genetic mutation and therefore have a lot of the same life experiences, to connect with each other and share experiences, knowledge, and information... in a way that you couldn't do," he added.
To date, the MSSNG project has sequenced 3,540 genomes and made 1,715 of these available through the repository, with the others being prepped for release once they have been consented. Pletcher told GenomeWeb. Macrogen handles the sequencing component for the MSSNG project and also does a first round of variant calling on the raw sequence. However, once they receive the data, the MSSNG collaborators recall all the variants in the data using the Google Genomics API, he said. Currently, "we are on track to have our goal of 10,000 genomes in the database by Q1 of 2016."
Users have access to over 17 billion variants, which have been annotated using the ANNOVAR pipeline, William Van Etten, a senior scientific consultant with BioTeam, told GenomeWeb. Also available is phenotype information collected from participants and data pulled in from public resources such as the Online Mendelian Inheritance in Man database, the Human Phenotype Ontology, RefSeq, and more, Van Etten said.
The MSSNG portal runs on compute engine serversthat are run by Google. The data is accessed either through Google's BigQuery, which runs SQL-like queries against the data in the repository; or through Google's implementation of the Global Alliance for Genomics and Health's application programming interface, which uses RESTful API to run research queries on the data, Van Etten said. Which of these two resources the portal uses depends on the type of query being run. BigQuery is more suited for searching and filtering large datasets, and is used to run large queries that call for traversing all 17 billion variants contained in the resource, he explained. The GA4GH API, on the other hand, works better for more focused queries — for example, if a user wants to search for variants in a specific genomic location. For portal users, the search infrastructure selection is made automatically when queries are entered so they don't need to make any special selections, he said. More skilled users who are accessing MSSNG via Google's frontend, on the other hand, would need to make the selection for themselves.
Users can run searches involving lists of genes based on gene symbol, gene or phenotype ontology attributes, or OMIM attributes. There are also options to search for variants based on genomic location, format, and gene; and also to filter variants by sample, subject, and significance. Researchers can also filter variants based on genomics effects such as frameshifts, splice sites, loss of function, and missense variations, he said. Since the repository contains information from the families of individuals, researchers can also conduct trio-based studies and explore de novo, autosomal, X-linked, and compound heterozygous variants that appear in samples. There's also an option to download the results of MSSNG queries for further analysis and exploration, Van Etten said. Users also have access to an implementation of the Broad Institute's Integrative Genomics Viewer that lets them visualize the results of searches run through the GA4GH API.
Researchers interested in using MSSNG information apply for access by filling out two forms. The first of these asks users to provide information about themselves, their affiliations, and also calls for a description of planned projects and uses of the autism data. The second form is a contract that dictates the terms that researchers must agree to abide by should their access requests be granted. Applications are reviewed by researchers at Autism Speaks and the Hospital for Sick Kids, Pletcher said. If the request is approved, the applicant is added to a Google group and given access to the data. So far, more than 60 researchers at 26 institutions have been approved to access the data.
Moving forward, Autism Speaks plans to actively promote use by researchers outside the autism space, Pletcher said, noting that "roughly two-thirds of the genomes in the database are from healthy individuals and they provide a reference database for anyone looking at any diseases area." In fact, two separate research groups studying pediatric cancer and cystic fibrosis have expressed interest in accessing the data. As a result, the MSSNG project collaborators are currently working on re-consenting individuals who submitted samples for the database so that their data can be used for these and other research purposes.
"We believe that the more scientists that work with our data, the better it will be for our community, even if the question they are asking isn't directly related to autism," he told GenomeWeb. "You never know where the important insight is going to come from and we may find that something we weren't even thinking about could be the key."
Moreover, even if these insights don't connect back to autism "the discoveries are still being made off the back of our families' genetic data so those discoveries will still have direct relevance to them and be of value," Pletcher said. So far the groups that have requested access to the data have not been given permission since Autism Speaks is still re-consenting participants. "We hope to have the re-consenting done by the end of the year so we can get these groups into the database," he said.
The partners plan to expand the portal to include a separate interface designed specifically to help families and individuals explore the results of their own genetic tests and connect with others who have similar gene variants and/or issues of interest. They also intend to add other new features and tools to the portal based on feedback from users.