CHICAGO (GenomeWeb) – Three years after the debut of the Exome Aggregation Consortium, a massive data set of exome variant calls, its successor — the Genome Aggregation Database, or GnomAD — has become ubiquitous as a reference among clinicians and researchers studying genetic variance.
"Basically every clinical lab around the US now uses ExAC and GnomAD as their standard reference databases," said Daniel MacArthur, codirector of medical and population genetics at the Broad Institute, and creator of both tools. "It's now being used really heavily in research as well," MacArthur said.
ExAC, which debuted at the 2014 American Society of Human Genetics meeting, contains analysis of variant calls across an aggregated collection of 60,706 human exomes, far larger than anything that came before. Combined, the ExAC and GnomAD sites have registered more than 11 million page views and get about 10,000 hits per day, MacArthur said.
MacArthur first started his laboratory in March 2012 at the Broad and at Massachusetts General Hospital, where he is a group leader in the Analytic and Translational Genetics Unit. Around that time, the hospital was just beginning to sequence exomes of patients with rare diseases, particularly children with muscular dystrophies and myopathies.
"We knew that in most of the cases, these were diseases that would be caused by one or maybe two variants that would be found in their DNA." MacArthur said.
The way to find these variants is to sequence exomes of patients and close relatives and look for variants "that are extraordinarily rare in the general population," he noted. "In order to make sense of those variants we were finding in our patients, we needed to be able to put those into context of very, very large numbers of sequenced individuals from the general population," MacArthur said. Ideally, this would take tens of thousands of people.
Back then, there were a couple of resources to help scientists interpret such variants: the 1000 Genomes Project, with sequencing data on about 2,500 people, and the Exome Sequencing Project, which had data on another 6,500. "Those were definitely valuable," MacArthur said, but they were inadequate for understanding outliers on the variant spectrum, and were not racially and ethnically diverse enough for what Mass General needed.
Thus, ExAC was born, though it took a year and a half to develop. "We had learned how to generate variant calls at scale. We now could perform variant calling very quickly and cheaply and accurately across tens of thousands of samples," MacArthur said.
GnomAD superseded ExAC in October 2016 when MacArthur's group released an update to the core set with 126,216 exomes and 15,136 whole-genome sequences. "In GnomAD, you can look at variations, not just in the protein-coding bits of the genome, but also in the non-coding bits," he said.
"I was aware that this was going to be a useful resource for people beyond my lab," MacArthur said. "We thought about building this originally mostly to facilitate our own rare-disease research, but as we started putting it together, we just had this enormous interest from people who just wanted to get access to it, so we made sure we made it available, as openly as possible."
There was no embargo, for example, for people studying individual variants. "People were using the data very heavily right from the very beginning," MacArthur reported. "One thing I'm really proud of in the context of ExAC and GnomAD is the way we've been able to push the data out so quickly and openly."
MacArthur's lab got the GnomAD call sets together about a month before the launch at ASHG 2016. "We just worked like crazy to clean up that data set and get it released, and we pushed it out to the world pretty much as soon as we felt it was ready to go out there, or possibly even slightly before," MacArthur said. This meant that others did not have to wait for Broad to analyze the data and publish scientific literature to take advantage of the resource.
"That is a relatively new approach to doing science," MacArthur said. "But I think it is the approach that really empowers the scientific community the most. It's not only that we believe that open science is a good way of doing things, but that we have now a community of 107 principal investigators who have allowed their data to be used as part of ExAC and GnomAD who agree with us."
The GnomAD team now is working on a new core set focused on whole genomes, with the goal of releasing data from about 65,000 whole genomes, possibly within a few weeks. "This will be the largest collection of human genomes that's ever been put together in one place," MacArthur said.
In the first half of 2018, the Genome Aggregation Database is planning to release the next exome core set, with about to 250,000 exomes, or nearly double the current number of samples. "We hope that will put us in a position where we can really deeply understand the impact of variation across human protein-coding genes," MacArthur said.
The Broad will continue running the same type of analysis of the core sets. "We'll be making that data available as quickly and openly as possible," MacArthur said.
Meanwhile, he has high aspirations for the evolution of GnomAD.
As codirector of the Broad Center for Mendelian Genomics, which sequences a couple thousand families annually, he has a particular interest in improving the diagnostic yield for rare diseases.
"At the moment, we can only diagnose somewhere between 30 and 40 percent of families, so we need to do better than that. Building larger versions of GnomAD will help, but we also need to get much better at building statistical frameworks for using them," MacArthur said.
Broad has been working with various collaborators to "develop and deploy those frameworks," he said.
MacArthur has ideas for where he wants the science to be in five years. When a patient presents with undiagnosed symptoms that could indicate a rare disease, he expects the clinician to order sequencing and then for the sequencing report to contain three key pieces of information.
"First, has that variant ever been seen before in large population databases like GnomAD? Our goal is to build GnomAD out in such a way that it is well-validated and serves as a default clinical reference data set," MacArthur said.
"The second thing is: Has that variant ever been seen before in patients?" Databases such as ClinVar collect information on variants found in patients with specific diseases, but they can be better, MacArthur said. He noted that ExAC and GnomAD can improve the quality of such collections by identifying variants listed ClinVar that might be too common to cause diseases, he said.
"The third bit, that we are not invested in, but that needs to happen as well, is that you need to be able to say, 'For this variant in this particular gene, if you do an assay of that gene's function, does this variant actually change that gene's function?" MacArthur continued.
"If you have those three classes of information — population, patient frequency and specific clinical phenotypes associated with that that variant, and functional evidence — once we have that data for all variants across the genome, we basically can convert this variant interpretation process away from the kind of dark art that it is now into an evidence-based science."
MacArthur envisions a formal statistical framework that lists a confidence level of disease association based on what is known about each variant. "Once we're there, then clinical genetics really becomes a true data science," he said. "Our goal is to build resources as large and as clean as possible to empower that future, to make that as easy as possible to build."