
NEW YORK — Following the release of nearly 100,000 whole genome sequences by the National Institutes of Health's All of Us Research Program last week, researchers can now embark on studies that make use of the new resource.
Nearly half of the data is from individuals who identify as belonging to a racial or ethnic group that is underrepresented in research. Previous large-scale genomic studies have relied heavily on populations of European ancestry, with some estimates indicating that more than 90 percent of large genomic studies have focused on Europeans, calling into question whether findings from those studies are generalizable to other populations. This new dataset — which includes not only sequencing, but also array data and linked electronic health and survey data — aims to be more representative of the US population.
With the data release, All of Us is now part of a select group of large-scale genomic research efforts, including the UK Biobank, the Million Veteran Program, and the NIH’s Trans-Omics for Precision Medicine (TOPMed) program.
Researchers can register for access to the sequencing and other data through a cloud-based platform, called the All of Us Researcher Workbench, and there are already hundreds of active projects in the platform directory, according to Andrea Ramirez, chief data officer for the All of Us program.
"We hope many people come in and really utilize the diversity of our dataset," Ramirez said.
The program began recruiting participants in May 2018 with the goal of enrolling 1 million individuals. It was aided by a consortium of partners to perform the genome sequencing, including Baylor College of Medicine, the Broad Institute, and the Northwest Genomics Center at the University of Washington. Currently, more than 326,000 individuals have consented to take part in the program, answered three surveys, provided physical measurements, and given at least one biological specimen.
The new data release includes whole-genome sequencing data on nearly 100,000 individuals as well as genotyping array data on 165,000 participants, about half of whom belong to a racial or ethnic group that is underrepresented in research.
The dataset additionally includes physical measurements such as height, weight, and blood pressure as well as data from surveys that asked, for instance, about participants' demographics, lifestyles, and general health. It also encompasses Fitbit data from some participants and ties into data from the American Community Survey, which is conducted by the US Census Bureau, to provide more context about participants' communities.
US-based researchers can access this data through the workbench's controlled access tier. In addition, there are other protections in place to protect participants' privacy. For example, All of Us has stripped the data of certain participant identifiers to lower their risk of re-identification.
To register for access, researchers' universities or institutions must first have a data use registration agreement in place, and more than 300 organizations already do. According to Ramirez, if a researcher's institution already has an agreement, they can get access to the All of Us data within about two hours, following a short training session on responsible data use.
The All of Us Researcher Workbench provides cloud-based access, so researchers do not need to have their own cluster to be able to use the data, as would be necessary for take-home datasets, Ramirez said. She added that the program uses Observational Medical Outcomes Partnership infrastructure to standardize across the different data types.
Access to the workbench is free but there are data storage and computation charges that may be incurred through the Google Cloud platform. Upon signing up, researchers receive a credit for $300.
Ramirez noted that the data release has only been available a few days, but that there are already hundreds of registered studies. These include ones looking into genes associated with vascular disease, genetic risk of sepsis, and genetic susceptibility to COVID-19.
Prior to the release of this dataset, she noted, project researchers "kicked the tires" of the dataset to test its utility as well as to validate it. They made sure it could, for instance, replicate known findings, such as a previously identified gene signature of genome-wide lipid signals, as well as be applied to new questions.
This also enabled them to provide other researchers using the workbench with code that can be easily adapted to their own research questions, so that they can focus on answering those questions, she said.
The program expects to have new data releases about twice a year, Ramirez said. The All of Us Research Program is also expected to be fully enrolled by the end of 2026.