NEW YORK – The Genome Aggregation Database, known as gnomAD, has generated a large dataset of human genetic variation that it says may fuel improved diagnoses and aid in drug development.
The 1000 Genomes Project, launched in 2008, aimed to generate a catalog of common and even less common genetic mutations found among people, though the effort was limited in scope and diversity, and it was then followed by the Exome Aggregation consortium (ExAC) project that focused on describing genetic variation within the protein-coding regions. With gnomAD, researchers have now further built on those efforts to compile a collection of more than 125,000 exomes and 15,000 genomes from individuals from populations across the world.
With this boost in sample size, researchers are better able to examine rare genetic variation, including loss-of-function variants, and the inclusion of whole-genome sequencing data will enable the analysis of variation that falls outside the protein-coding region that could still influence disease. In a series of papers appearing on Wednesday in Nature, Nature Communications, and Nature Medicine, the gnomAD team showed how the dataset could be used to improve rare variant interpretations, evaluate potential drug targets, and more.
"These studies represent the first significant wave of discovery to come out of the gnomAD Consortium," the Broad Institute's Daniel MacArthur, scientific lead of the gnomAD project, said in a statement. "The power of this database comes from its sheer size and population diversity."
GnomAD encompasses 125,748 exomes and 15,708 genomes from individuals from six global and eight sub-continental ancestries. In particular, gnomAD includes genomes and exomes from more than 25,000 people of East and South Asian ancestry, about 18,000 Latinos, and about 12,000 individuals of African or African-American ancestry.
The datasets were largely obtained from case-control studies of adult-onset diseases and each underwent uniform processing. In all, the researchers uncovered 17.2 million variants within the exome dataset and 261.9 million variants within the genome dataset.
For their analysis appearing in Nature, Konrad Karczewski from the Broad and Massachusetts General Hospital and his colleagues focused on variants predicted to affect the function of protein-coding genes. In all, they identified 443,769 predicted loss-of-function variants — more than had previously been described.
Using a metric they developed called loss-of-function observed/expected upper bound fraction (LOEUF), the researchers classified genes based on how well they would tolerate being inactivated. Then with data from a separate cohort, they found rare variants in intolerant genes were more likely to be found among those with intellectual disability, than those without.
"The gnomAD catalog gives us our best look so far at the spectrum of genes' sensitivity to variation and provides a resource to support gene discovery in common and rare disease," Karczewski said in a statement.
But loss-of-function variants in intolerant genes can also be found among individuals who have no outward effects. In another paper, the Broad's Beryl Cummings and her colleagues found that alternative mRNA splicing may account for such unexpected variants, and they further developed a metric dubbed proportion expressed across transcripts, or pext, to quantify isoform expression.
Naturally occurring loss-of-function variants can also be used to help assess the effects of drugs that target those genes, the Broad's Eric Minikel and his colleagues reported in Nature. By studying individuals with two predicted loss-of-function variants affecting the same gene, scientists can gauge what the effects might be of the loss of that gene through therapeutic interventions.
Additionally, as they reported in Nature Medicine, researchers led by Imperial College London's Nicola Whiffin identified 1,455 individuals with predicted loss-of-function variants within the Parkinson's disease risk-linked gene LRRK2 from within the gnomAD, UK Biobank, and 23andMe cohorts. They estimated that about 1 in 500 people are heterozygous for a predicted loss-of-function variant, leading to a decrease in LRRK2 protein levels, but no as-yet noted effect on survival or health.
Meanwhile, using the whole-genome sequencing data from the gnomAD dataset, researchers led by the Broad's Ryan Collins identified more than 433,000 structural variants, DNA rearrangements that affect stretches of DNA that are more than 50 nucleotides in length. They found that about a quarter of rare, protein-truncating events are structural variants.
Additionally, they found that a portion of people carry structural variants that are expected to be harmful, but don't have the expected phenotype or clinical outcome, and that genes that were sensitive to deletion were often also sensitive to being duplicated.
At the same time, another team reported in Nature Communications that they were able to identify nearly 1.8 million multi-nucleotide variants — clusters of nearby variants — and 31,575 that fall within a codon.
Another paper, also appearing in Nature Communications, used gnomAD data to examine variants that affect and disrupt upstream open reading frames to find that variants like those may be an under-recognized source of variation contributing to disease.
"The gnomAD resource, like ExAC before it, will change how we interpret individual genomes," Inscripta's Deanna Church wrote in a related commentary. "The consortium's work has revealed how much information about human variation we had been missing and has provided tools that help us to better understand the genome at both the population and individual level."
She noted, though, that despite the size of the gnomAD dataset, the cohort is not always large enough for certain analyses.
"[W]e are very far from saturating discoveries or solving variant interpretation," MacArthur, who is now at the Garvan Institute of Medical Research and Murdoch Children's Research Institute in Australia, added. "The next steps for the consortium will be focused on increasing the size and population diversity of these resources, and linking the resulting massive-scale genetic data sets with clinical information."