NEW YORK (GenomeWeb) – Members of the Exome Aggregation Consortium led by scientists at the Broad Institute of MIT and Harvard have published scientific findings from their analysis of exome sequencing data from 60,706 individuals from diverse ethnic backgrounds and ancestries.
Consortium members have already described some of the findings in other forums, but the current paper, which came out this week in Nature following initial publication on the bioRxiv site, offers more details about their two-year analysis including details about methods and key findings.
Specifically, the researchers report being able to directly observe mutational recurrence in the dataset. They also discovered 3,200 genes that had fewer loss-of-function or missense mutations than would be expected suggesting that these are likely disease-causing variants that are rare or absent in the population because of their detrimental effect on human health. The paper also includes analyses that showcase the value of the ExAC dataset for paring down lists of variants identified in clinical contexts.
Furthermore, two companion papers also published this week in Nature Genetics and Genetics in Medicine discuss separate studies that provide more in depth detail on rare copy number variants and highlight the value of using the ExAC data to assess rare variants in the context of a particular disease.
According to the Nature paper, the consortium's assessment of the data yielded over 7.4 million high-confidence genetic variants — about one variant for every eight base pairs — the vast majority of which are "extremely rare," Daniel MacArthur, senior author of the Nature study, co-director of the Broad's Medical and Population Genetics program, and an assistant professor at Massachusetts General Hospital and Harvard Medical School, told GenomeWeb this week.
According to the paper, these rare mutations were mostly low-frequency variants that do not show up in smaller datasets generated by projects like the National Heart, Lung, and Blood Institute's Exome Sequencing Project (ESP) and 1000 Genomes project, nor are they present in the dbSNP repository. Specifically, 99 percent of the variants have a frequency of less than one percent and 72 percent are not present in both the 1000 Genomes and ESP datasets, according to the paper. Furthermore, 54 percent of these variants show up only once in the 60,000-person cohort. "If all those turn out to be new to this dataset, they've never actually been previously reported," MacArthur said.
The size of the ExAC dataset also made it possible to directly observe properties such as mutational recurrence which, as the paper explains, are cases where the same mutation appears multiple times independently in different sequenced population. For example, 43 percent of validated synonymous variants identified in a separate dataset from 1,756 parent-offspring trios from the Deciphering Developmental Disorders study and elsewhere were also present in the ExAC dataset. That percentage is much higher for transition variants at CpG sites. Specifically, 87 percent of previously reported de novo CpG transitions in synonymous variants were also present in the ExAC dataset.
The researchers were also able for the first time to look for genes that were missing particular kinds of variations, MacArthur said. For this analysis, they used a model to predict for each gene how many loss-of-function variants would be expected by chance and then compared that number to the number of loss-of-function variants actually observed in the genes in the ExAC dataset.
"The difference between the observed and expected numbers basically gives us some measure of how many loss-of-function variants are being removed from the gene by natural selection, so in other words how many are actually harmful and damaging and therefore removed by natural selection," he explained. "The resulting list of [around] 3,000 genes that are very heavily anticipated for loss-of-function variation, we think, is a very compelling set of genes to go after in finding new disease-causing genes in a whole range of different diseases."
According to the paper, of 3,230 loss-of-function intolerant genes identified in the population with "near-complete depletion" of predicted protein-truncating variants, 72 percent currently have no reported disease phenotype in the Online Mendelian Inheritance in Man or ClinVar databases.
The consortium members also sought to quantitatively demonstrate that the ExAC database can be used to more effectively filter lists of candidate pathogenic variants. Since the datasets were first released in October 2014, the ExAC resource has been accessed over five million times in large part by researchers in clinical diagnostic laboratories who are looking to better understand how common the variants they find in rare disease patients are, MacArthur told GenomeWeb.
For example, researchers in the laboratory of Heidi Rehm, medical director of the Broad Institute's Clinical Research Sequencing Platform and chief laboratory director of the Laboratory for Molecular Medicine at Partners Personalized Medicine, use frequency information from the ExAC database to quickly rule out those that are relatively common variants and hone in on the true disease-causing variants. It "gives us incredible insight when evaluating a patient's genome sequence in the clinic," she said in a statement.
Matthew Hurles, a researcher at the Wellcome Trust Sanger Institute, noted in a statement that "in our own research, using the ExAC resource has allowed us to apply novel statistical methods to identify several new severe developmental disorders. Resources such as ExAC exemplify the benefits that can be achieved for families coping with rare genetic diseases, as a result of the mass altruism of many research participants who allow their data to be aggregated and shared."
Evidence for the value of the ExAC data in helping researchers identify harmful variants in clinical contexts is provided in the paper. Specifically, the researchers compared the value of using the ExAC dataset to pare down lists of variants to using data from NHLBI's ESP. They found that using the ExAC dataset to prioritize deleterious variants based on allele frequency and functional effect reduced the number of candidate variants sevenfold compared to the NHLBI dataset. In contrast, the ESP dataset did not have sufficient power to filter variants at less than one percent allele frequency without removing many truly rare variants, according to results reported in the paper.
Specifically, when the researchers filtered variant data from 500 randomly chosen ExAC individuals using allele frequency information from ESP or the remaining ExAC cohort, they obtained an average of 154 variants for analysis using the ExAC data compared to a list of over 1,000 variants after filtering against the ESP. "We thus expect that ExAC will provide a very substantial boost in the power and accuracy of variant filtering in Mendelian disease projects," the researchers wrote.
Furthermore, their clinic-centric analysis also showed that more than 100 mutations previously reported to be disease-causing are more likely to be benign. In particular, out of 192 analyzed as part of the study, only nine variants had sufficient data to support disease association, while 163 variants met American College of Medical Genetics criteria for reclassification as benign or probably benign. As of December last year, 126 of the 163 variants on the list have been reclassified, according to the paper.
Two other studies published alongside the ExAC consortium's report provide further evidence of the dataset's value. One paper published in Nature Genetics in collaboration with researchers at Mount Sinai School of Medicine describes efforts to characterize the rates and properties of rare copy number variants in well over 59,000 individuals included in the ExAC database. Researchers involved in that study were able to show that considering genes' intolerance to CNVs could be used to predict the likelihood that the CNVs in question are harmful. As part of that study, they analyzed data from over 4,700 schizophrenia cases and just over 6,100 controls. They report higher intolerance to CNVs in genes from the schizophrenia cases compared to genes from the control dataset, among other findings.
Meanwhile, the Genetics in Medicine paper focused on efforts to reanalyze variant data from nearly 8,000 cardiomyopathy patients in the context of the ExAC datasets. According to the paper, the researchers found that for some genes previously reported to be important in cardiomyopathy cases, "rare variation is not clinically informative because there is an unacceptably high likelihood of false-positive interpretation." In contrast, for some other genes, "we find that diagnostic laboratories may be overly conservative when assessing variant pathogenicity."
Separately, members of the consortium published a third paper earlier this year in Science Translational Medicine that used the ExAC data along with information from the 23andMe database to assess the effects of variants in the prion protein (PRNP) gene on the risk of prion disease. That study revealed, among other findings, that missense variants in PRNP gene previously reported to be pathogenic are at least 30 times more common in the population than expected.
In his conversation with GenomeWeb, MacArthur highlighted the collaborative nature of the project as well as the importance of open data. "It really emphasizes the critical value of large-scale data sharing," he said. "We actually were heavily reliant on collaborators who had sequenced exomes for their own disease-specific studies who then agreed to donate that data to our analysis group to build a resource for the community."
"It's pretty amazing to see more than $50 million worth of sequencing data basically contributed by more than 20 collaborators from around the world," he added. "We're hopeful that ExAC can serve as a model for data sharing and aggregation in other settings as well."
The researchers also benefited from computational resources — both in terms of hardware and software — that were provided by the Broad Institute, Monkol Lek, a research fellow in MGH's analytic and translational genetics unit and the first author on the Nature paper, told GenomeWeb. The consortium researchers used the haplotype caller pipeline from the Broad's Genome Analysis Toolkit to call single nucleotide variants, short insertions, and deletions across the 60,000 plus samples included in the ExAC dataset. The group also applied various methods and techniques to improve quality of the data for the analysis, he said. In total, the researchers assembled and processed about a petabyte of raw sequencing data from contributing consortia. "It was very challenging because it was the first time anyone's ever looked at data from this many people [to] know how that data should behave," Lek said.
Moving forward, the consortium researchers hope to double the sample size of the ExAC cohort. They plan to release a new call set that will have over 120,000 exomes in it, MacArthur told GenomeWeb. This 60,000-person cohort has made "a huge difference in our ability to filter variants in particular diseases but it's not enough and … it's not diverse enough," he said. "We do have many populations represented but we are also missing huge swathes of human genetic variation and that means that there are still genetic variants that pop up in our rare disease patients that we still can't make sense of because we don't have those sample sizes yet."
For example, although the ExAC data is a more ethnically diverse pool than both the 1000 Genomes and ESP efforts, Middle Eastern and African populations are underrepresented in the dataset, according to the paper. Furthermore, "although we have attempted to exclude severe pediatric diseases, the inclusion of both cases and controls for several polygenic disorders means that ExAC certainly contains disease-associated variants," the researchers wrote. Moreover, most of the ExAC samples do not have detailed phenotype data associated with them which is crucial for understanding the biological and clinical context of these variants, the researchers wrote.
Also, with 60,000 samples, "we can look at loss-of-function intolerance for most of the genes in the human genome," MacArthur said. That's true for about two-thirds of genes but the remaining third of the genes are too small, he explained, so "we would need larger sample sizes to actually make any kind of statistical assessment."
The consortium members also plan to provide the results of their analysis of variant distribution in non-coding regions as well, MacArthur said. That project will aggregate and analyze data from over 15,000 whole genomes. They will also release the data from the genomes. "That will be [a] first look at very big frequency variation within non-coding regions as well as protein-coding regions," he said.
Also, consortium members will soon begin running pilot projects focused on connecting clinical and genetic data. "As part of building a resource for genomic medicine, [we] need to find those individuals that carry a particular variant in these large datasets and recall them to figure out what type of phenotypes they actually carry," he said. "That linkage between the genetic data and the clinical data is something that because of the way that ExAC was built … [is] very difficult to do." Moreover, as large-scale projects like the US Precision Medicine Initiative begin to move forward, "we'll start to see a lot more projects in that general space of pulling together both genetic and clinical data from large sets of individuals," he added.