NEW YORK (GenomeWeb) – Researchers at Stanford University and their collaborators at other institutions are developing an open repository of data from individuals living with autism spectrum disorders and their families that they hope will help push ASD research forward.
The so-called Hartwell Autism Research and Technology Initiative (iHART) is funded by a $9 million grant from the Hartwell Foundation. The goal of the project — which is also supported by the Simons Foundation — is to provide easy access to phenotypic, proteomic, metabolomic, and genomic datasets along with brain activity measurements and imaging, gut microbiome data, blood-based biomarkers, physicians' notes, diagnostic tests, and treatment protocols from potentially 10,000 individuals with autism and their families. They'll also develop a portal to the data that will incorporate tools for crafting and running research queries that draw on the various kinds of information contained in the platform.
Dennis Wall, an associate professor of systems medicine in Stanford's pediatrics department and director of the iHART platform, told GenomeWeb that so far the researchers have collected more than 2,000 genome sequences as well as clinical and other information, and hopes to have a total of 5,000 genomes by early next year. The data will be accessible to researchers as well as individuals outside the research community such as families participating in the study.
Currently, the plan is to open up the portal for use by spring next year but possibly sooner, Wall said. For now, the Stanford team is focused on getting its first batches of data together, exploring the infrastructure requirements on the cloud, and building out the most appropriate system, he told GenomeWeb.
The researchers are also crafting guidelines, patterned after those used by the National Institute of Mental Health for data access, which will govern the use of the iHART data. Once potential users agree to guidelines, they'll be able to either download datasets and analyze them locally or run their analyses in the cloud. Datasets would not be restricted to just ASD research use but could also be used to study genetic disorders.
Datasets collected as part of the endeavor will be stored in a relational database-like infrastructure hosted and maintained on the cloud. Currently, the developers are exploring Google-developed computing and storage infrastructure and tools such as BigQuery and Google Compute Engine, but they are also considering other options such Amazon Web Services and Illumina's BaseSpace. "Our goal is to find the best solution for ... the computationally initiated and uninitiated alike," he said.
The planned systems will also include tools and algorithms developed by the Stanford researchers, Wall said. For example, his team has developed a pipeline for processing whole genome sequence — from BAMs to annotated VCFs — on the cloud faster by running tasks in parallel across multiple compute nodes that will be included in the iHART infrastructure. The pipeline is described in a paper that will be published next month in BMC Medical Genomics.
In addition to existing samples already available at Stanford, the researchers are working in collaboration with the laboratory of Daniel Geschwind at the University of California, Los Angeles and with researchers at the New York Genome Center to sequence and gather additional datasets, he said. Some data is coming from families that have multiple individuals with autism, and other datasets are being collected from families with monozygotic twins and female patients as well, he said. Researchers also are collecting sequence and associated data from parents and unaffected siblings.
As part of their efforts to populate the database, the researchers are using applications on iOS and Android devices to recruit study participants, specifically targeting families with at least two children around six years of age, one with autism and one without, Wall said. They are collecting spit and fecal samples from these families — to obtain microbiome and SNP information — as well as video observations of affected children, he said, that will help researchers investigate interactions between genes and environments in the context of ASDs. They also hope to use apps to recruit unaffected patients showing behavioral symptoms also exhibited by individuals with ASDs such as learning or speech delays, Wall said, and to use their information to test computational models designed to help physicians better identify autism cases.
A separate but very similar project to iHART is being run by Autism Speaks. The MSSNG project — formerly called Autism Speaks Ten Thousand Genomes Program — was launched last year and also seeks to collate and provide access to genomic, phenotypic, and clinical data from 10,000 autism patients and their family members. It is a collaboration between Autism Speaks and the Hospital for Sick Children's Centre for Applied Genomics.
Like the Stanford effort, Autism Speaks is also working with Google's genomics arm to create a platform for storing and querying the collected data. The autism advocacy group also worked with bioinformatics consultancy BioTeam to design an interface that would give more casual users — genetic counselors, for example — easy access to the data. As of early September, the project claims to have sequenced over 3,500 genomes and made just over 1,700 of these available in Google's cloud platform.
It's not entirely clear what differences, if any, exist between the two ASD efforts. "As I understand it, MSSNG is focused on a collection of samples to sequence and is not looking for the community to share or combine data at the moment," Wall told GenomeWeb.
Matthew Pletcher, vice president and head of genomic discovery for AutismSpeaks, acknowledged the similarities between both projects including their shared goals but also noted that access to more data is a good thing for ASD research community. "iHART is quite complementary to what we are doing with MSSNG," he said. "If we can find a way to bring that altogether in one place, I think in the end the community benefits from that."
Wall also believes that both projects can contribute meaningful resources to ASD research and said that he is open to collaborations with Autism Speaks and any other groups currently working in the space. My hope is that these two efforts will join forces and become a unified front that really generates the data size that the field needs to make clinically impactful gains in this area," he told GenomeWeb.
In response to a question about whether the projects might have patients in common, both Wall and Pletcher told GenomeWeb that both groups are taking to steps to avoid duplicating datasets in their respective resources as much as possible. As part of those efforts, Wall and his collaborators have publicly shared information about the samples they have collected, he said.Wall also told GenomeWeb that his group plans to look into ways of linking their datasets to those managed by the National Institutes of Health's National Database for Autism Research (NDAR) including options such as depositing some or all of their data in the NDAR repository or simply connecting the two databases together and enabling iHART users to pull information from NDAR as needed.