NEW YORK (GenomeWeb) – Renewing the National Institute on Aging Genetics of Alzheimer's Disease Data Storage Site (NIAGADS), the University of Pennsylvania's Neurodegeneration Genomics Center (PNGC) aims to work with early users and investigators to provide additional patient data for eventual clinical use.
Li-San Wang, an associate professor at UPenn and PNGC's codirector, explained that his team began hosting NIAGADS in 2012, with the idea to build a one-stop access point dedicated to Alzheimer's disease genetics. At the PNGC, the researchers use NIAGADS to store genotyping and sequencing data from the genomes of Alzheimer's disease patients.
Funded by the National Institute on Aging (NIA), NIAGADS has acted as a research resource — including genotyping, phenotyping, and raw sequencing data — for the NIA's Alzheimer's Disease Genetics Initiative and other outside research groups since 2006.
"We follow up with participating cohorts, track phenotypes and sequencing data production, and share [the results] with the [research] community," Wang said. "We also host genome-wide association studies datasets from other groups [and] full genetic database tools, such as workflows used to harmonize the sequence data."
Wang explained that the National Institutes of Health selected his group in 2012 as the site for NIAGADS because of his team's experience in Alzheimer's disease genetics. He pointed out that his group had previously contributed samples to different genetic studies that focused on Alzheimer's disease, and had worked on human subject research, protection, and regulation policies.
"At about the same time, we started to use Amazon Cloud, which gave us experience and skills to perform large-scale production effectively," Wang noted. He believes that there is not "a single educational institution that will have the cluster that is big enough to process tens of thousands of genomes."
While Penn has acted as the site for NIAGADS since 2012, NIA debated moving the project's location in 2015 as it neared the end of its partnership with UPenn. After applying for the project's renewal in 2016, UPenn received approval from NIA to continue hosting NIAGADS until 2022.
In January 2017, the database of Genotypes and Phenotypes (dbGaP) initiated a policy change to stop receiving large sequencing files — raw sequencing reads — from projects like UPenn's Alzheimer's Disease Sequencing Project (ADSP), which aims to identify both genomic variants that contribute to increased risk of developing Alzheimer's disease and those that protect against the disease.
"We were scrambling to find a solution in order to implement the same process [that maintained] human subject protection," Wang explained.
Exploring another way to share the files, NIA decided that NIAGADS would develop a data sharing method that would also be compliant with NIH policy on informed consent and human subjects' genetic data protection. Instead of continuing to partner with dbGaP, NIAGADS reimplemented its system to both protect patient information and directly offer data to end users through its database.
Since developing the newly protected data sharing system, Wang's team has sequenced the genomes of over 5,000 Alzheimer patients' samples and can analyze about 1,000 genome sequences per week.
Wang's group at Genomic Center for Alzheimer's Disease (GCAD) — which is part of PNGC — is partnering with providers of existing research cohorts that have DNA and phenotype information available to process the data. After Wang and his researchers review the samples' availability of DNA, they sequence them at collaborating sequencing centers.
In addition, Wang's team is working with other groups that generate DNA sequencing data, such as the Alzheimer's Disease for Neuroimaging Initiative — a global collaboration with the largest studies using neural MRI imaging — to process the data for downstream research use.
"With Alzheimer's, there are multiple reasons why its [genetics] is complex, especially due to its heterogeneity, and there are lots of limitations on what you can study," Wang noted. "For example, you can't perform biopsies to figure out what's wrong, and building a cohort can be challenging since the disease occurs in very old patients."
According to Wang, NIAGADS currently has two major components: a genomic database and a data sharing platform. Researchers can search for specific genes on the genomic browser, in addition to examining genome ontology locations. Wang noted his team will upload findings from GWAS studies and from the Alzheimer's Disease Sequencing Project to the database.
In addition, NIAGADS' data sharing platform will allow researchers to access "individual level data," including phenotypes, genotypes, and raw data such as sequencing reads and genomic-level variants downstream analysis. Wang noted that the database uses a data portal that researchers can access to examine raw data files. The database will also display the file's quality metrics and allow the researchers to view the file's size, as well as which genome center developed the data, before selecting which file to download.
Early user Badri Vardarajan, an assistant professor of bioinformatics at Columbia University, is actively using the NIAGADS database for his own research on Alzheimer's disease genetics. He explained that researchers can select certain epidemiological traits, such as race, ethnicity, or age. Outside entities, such as companies interested in using the data for commercial use, can select patients based on their exclusion criteria and consent levels.
Vardarajan said that once researchers initially select data, the repository creates a cart file, similar to Amazon's website, that allows them to download the information.
"On our end, we're using the NIAGADS website to download and upload our datasets for access to the public," Vardarajan explained. "In that aspect, it's worked out really well for us, since the technology has a simple interface and is very easy to use."
Since the files the user downloads can be several terabytes of raw data, NIAGADS provides Aspera's Connect Server for high-speed file exchange, Vardarajan said.
Wang explained that the researchers at PNGC will also "harmonize" the data collected from outside studies, which includes sequencing, genotype, and phenotype data. He noted that the fastest and most efficient way to assemble such a large dataset is to collaborate with providers of a variety of specialized cohorts who are already collecting data on patients with Alzheimer's.
The key challenge working with outside cohorts, however, is that they "might have existing data generated with several different sequencing platforms using different workflows," Wang noted. A cohort's phenotype "might also be classified differently, and [the teams] have different ways to encode the information."
GCAD therefore reorganizes and reintegrates outside data into a single coding scheme. The researchers streamline the sequence data in an identical fashion — via re-mapping and recoding algorithms — to minimize batch effects, which can potentially lead to false-positive results. As outside groups send in samples, Wang's team processes and analyzes them immediately, storing results per individual sample in the database.
While Wang's group currently uses the Broad Institute's Genome Analysis Toolkit (GATK), he said that GCAD will also use the xAtlas caller developed by the Human Genome Sequencing Cancer at Baylor College of Medicine. By using both the GATK and xAtlas, Wang's team will be able to compare the two callers and obtain more robust genotype calls, he said.
Wang envisions the database's information being used for a wide variety of downstream clinical applications. He believes that research labs who do not necessarily use genetic analysis "but are interested in finding out the biology behind the [disease's] genetic variance could benefit from the information."
"As sequencing costs go down, we have more findings, and if investigators contribute more samples, we might even [have] more results," Wang said. He added that NIAGADS' pipeline is optimized to run on the Amazon Cloud but that the source code is available for researchers who want to use other cloud-based platforms in their local environment.
Vardarajan also argued that NIAGADS is especially important in terms of validity and reproducibility of research, as it makes the research and analyzing process much more streamlined and efficient for outside cohorts.
Despite Vardarajan's familiarity with the system — since his team transfers its own data — he acknowledged that external and first-time users may have trouble exploring the system because of the enormous amount of raw data in the repository.
"One of the things that can be improved [is] the search function," he said. "If they make the search tool [for samples] more palatable for the user to choose their [specific] criteria for downloading the data, that would make [the repository] even more user friendly."
NIAGADS plans to release sequencing data on an additional 20,000 whole genomes by summer 2019. Wang's team is especially curious about inherited risk factors in patients. While he acknowledged that the database may not deliver true diagnostic results, he believes that NIAGADS may provide researchers with potentially useful biomarkers for future studies.