NEW YORK (GenomeWeb) – In 2014, researchers from the Broad Institute of MIT and Harvard and Massachusetts General Hospital published a list of roughly 10 million genetic variants in a database that they named the Exome Aggregation Consortium, or ExAC, database. The study, published today in Nature, describes their procedures for collecting data for ExAC and identifies 3,200 genes that the researchers believes are likely involved in the development of genetic disease in humans.
"As my lab discovered four years ago when we first started sequencing rare disease patients, and as many labs around the world have found, a key challenge in analyzing the exome sequencing data from patients is that everyone carries tens of thousands of genetic changes," senior author Daniel MacArthur said in a press briefing. MacArthur is the co-director of medical and population genetics at the Broad Institute, and an assistant professor at Massachusetts General Hospital and Harvard Medical School. Researchers and clinicians need databases that can tell them which genetic changes found in a patient are also seen in healthy people and how common those changes are so that researchers can identify genetic changes that are actually causal for a patient's disease, he added.
MacArthur also noted that the work highlights the importance of data sharing, since this project would not have been possible without the data contributions from more than 20 different research groups and three dozen principal investigators.
The researchers collected exome sequencing data from European, African American, East Asian, South Asian, and Latino individuals. They ran the raw data through new version of the Genome Analysis Toolkit HaplotypeCaller pipeline, a processing pipeline developed by the Broad Institute, and then produced a set of variant calls that were the same across all 60,702 samples.
The researchers produced the summary file, made publicly available in 2014 through an open access website, from these variant calls. They also reported that since its publication, the resource has been used more than five million times by researchers around the world. "Its primary use is in the interpretation of genetic changes found in rare disease patients, and now virtually all clinical diagnostic labs now use the ExAC resource as their standard resource database for diagnosis of rare disease patients," MacArthur said in the press briefing.
In their new study, the researchers filtered and analyzed the data in the ExAC database to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation.
MacArthur explained that the "healthy" individuals in the database help researchers identify genes that are intolerant to variation, to more easily narrow down the pool of genes that are more likely to cause genetic diseases such as muscular dystrophy or epilepsy.
"In total, we were able to use this resource to zoom in on a set of just over 3,000 genes that are extremely likely to be involved in diseases," MacArthur said, adding, however, that there is no clear link to specific diseases for more than two thirds of those genes.
Additionally, MacArthur and his team found that nearly 200 of the reported genetic variants that had previously been labelled as disease causing are too common in the ExAC database to be linked to disease. "We show that they must actually be harmless variants that have ended up in these databases through error," he said. "It basically allows us to use this resource to correct some of the errors that have crept into these databases."
The researchers did note that while the ExAC database is nearly ten time larger than previous resources and contains quite a lot of diversity, it is not yet representative of the global population. However, MacArthur noted, the list of genes that they identified as likely contributing to genetic disease can now be prioritized in downstream studies.