CHICAGO – The National Human Genome Research Institute's (NHGRI) popular Genotype-Tissue Expression (GTEx) dataset is now available for free download through a cloud-based platform, potentially saving researchers as much as $14,000 in access and storage costs per download. The move, according to its backers, promises to democratize use of the largest existing compendium of human gene expression and corresponding trait loci, bringing even the smallest institutions into the fold.
Last month, NHGRI issued version 8 of GTEx, the first "free" release on the Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) platform. NHGRI established AnVIL in late 2018 to create a cloud-based environment for working with the GTEx dataset, which includes genotype data from 838 donors plus 17,382 RNA sequences across 54 tissue sites and two cell lines.
The National Institutes of Health Common Fund established GTEx in 2010 as a 10-year multi-institutional research effort to present a comprehensive atlas of genetic regulatory variation across cell types and tissues and an analysis of how these changes in regulation can contribute to risk for disease and the development of traits. The research concluded in September, but the dataset endures to assist outside scientists.
Michael Schatz, program director for the AnVIL platform, said that GTEx is the "most highly requested" dataset throughout the entire National Institutes of Health.
"We started to take it for granted that you can just go to the web at any time and download it, but in reality, there's a lot of infrastructure costs," said Schatz, an associate professor of computer science and biology at Johns Hopkins University. The full GTEx dataset contains about 40,000 individual files and requires about 150 terabytes of storage.
"If you want the whole collection, it's going to take realistically several days to download," Schatz said. "Researchers constantly streaming those data would consume all of [the National Center for Biotechnology Information's] bandwidth. There's just a lot of overhead with that."
The AnVIL project moved GTEx data exclusively to the cloud, taking advantage of a 2015 NIH policy change that allowed investigators to request permission to move Database of Genotypes and Phenotypes (dbGAP) genomic and associated phenotype data from agency repositories to public or private cloud systems for data storage and analysis.
"We foresee that in the long term, more users will choose to perform the analysis of GTEx and other large datasets directly within AnVIL's cloud environment," the AnVIL team wrote on their website in November. "AnVIL offers an elastic, shared computing resource, with active threat detection and monitoring, that provides an increasingly attractive alternative to redundantly downloaded data amongst siloed compute infrastructure."
"This new capability will have far-reaching implications with substantial cost savings as well as greatly expanded democratization of these data," Schatz said.
The AnVIL platform is a federated system with multiple components that permit several types of analysis to take place in the cloud, giving researchers a single place to build cohorts from several NHGRI datasets, including those from the Centers for Common Disease Genomics, the Centers for Mendelian Genomics, and the Electronic Medical Records and Genomics (eMERGE) network.
It supports Terra — a cloud-based bioinformatics analysis platform codeveloped by the Broad Institute and Google sibling Verily Life Sciences — as well as the Dockstore workflow-sharing platform from the Cancer Genome Collaboratory.
Data in AnVIL is organized in Gen3, a data commons framework developed by University of Chicago bioinformatician Robert Grossman, and UChicago has joined Google Cloud as a host for GTEx data, according to Schatz. Gen3 also supports ingestion and querying of the information including metadata attached to the records. This allows users to access only the portions of the massive dataset they are truly interested in.
"Thanks to Grossman's team, you have this amazing capability where you can pick and choose: 'I want gene expression data from a particular tissue,' or, 'I want it from people of a particular gender,'" Schatz explained. "You can slice and dice up the GTEx dataset however you might be interested and then just extract and download just those data."
The AnVIL deployment supports the open-source Bioconductor and Galaxy toolsets, which offer thousands of tools for computational and statistical modeling and data analysis. Users also can access the data through notebook environments such as Jupyter and R-Studio.
"Our goal is to provide anything you might do on a local computing infrastructure, be it your laptop or your institutional data center, but do it at scale in the cloud," Schatz said. "We don't have to have data centers all over. We can have harmonized data access, harmonized data security, and harmonized capability."
The data is available for AnVIL to perform analysis in the cloud. "There are use cases where it's important to download, especially if you want to integrate your own patient data with GTEx data and do a cross-analysis," Schatz said. "It may not be possible to upload patient data into the cloud, so now you can download it to your institution."
But GTEx still can be downloaded for those who need the information on local infrastructure.
"If you want to do variant calling, if you want to do association studies, polygenic risk scores, or expression analysis … anything at scale that you might want to do can be done now in the cloud in AnVIL," Schatz said.
Kaur Alasoo, genetics group leader in the University of Tartu Institute of Computer Science in Estonia, has been testing the free GTEx download capability as he and former colleagues at the European Bioinformatics Institute build a database of expression quantitative trait loci (eQTL) called eQTL Catalogue.
"To ensure that the results are directly comparable between studies, we have chosen to download the raw [GTEx] data from individual studies and reprocess them with exactly the same data analysis workflows," Alasoo said by email. His team has been integrating GTEx into its database since July, but only recently has he been able to access version 8, by far the largest individual dataset in the eQTL Catalogue.
Because Alasoo only needed RNA-seq data and genotype calls from GTEx, he said that the free access via AnVIL has probably saved $8,000.
"We postponed the decision to download as long as possible while we considered other options, but all of them would have involved a significant investment of time, money, or both," Alasoo said. He was about to spend the money in September when he caught wind of the pending release of AnVIL's free option, so he decided to wait a bit.
Alasoo said that his team is reanalyzing GTEx version 8 with the same workflows they have used for about 20 other datasets. "This means that our users will be able to directly contrast the eQTL effect size from these 20 other datasets against GTEx v.8 without worrying about technical differences between datasets," he said.
He said that having all these datasets in one place will simplify future cross-analysis of GTEx information with overlapping tissues and cell types, helping to increase sample sizes.
According to Schatz, release of GTEx version 9 is "imminent," a development that will add to the collection of whole-genome sequences, and a version 10 also is in the works for 2021.
Schatz said that he is interested in adding the capability of downloading subsets of genomes for those not performing genome-wide analyses. "Right now, you'd have to download the whole CRAM file and then once downloaded, you could suss out a region of interest," he said.
He said that the AnVIL project is at a "transition point" now that is has moved into the third year of a five-year funding award from NHGRI because so many research institutions are moving from local computing infrastructure to cloud environments to support massive NIH datasets from the likes of All of Us and Trans-Omics for Precision Medicine (TOPMed).
"It's just totally impractical to do that sort of analysis on local infrastructure, so I think cloud computing is going to become increasingly important and increasingly used, especially for these large genome analyses," Schatz said.