VANCOUVER, British Columbia (GenomeWeb) – Building on the success of the Exome Aggregation Consortium (ExAC) dataset, members of the same research team have established a collection that contains roughly twice as many exomes as the version of ExAC released to the public two years ago, analyzed alongside more than 15,000 whole-genome sequences.
The Broad Institute's Daniel MacArthur introduced the resource, known as gnomAD, at the American Society of Human Genetics meeting today. MacArthur noted that more than 5,000 principal investigators provided exome and genome data for gnomAD, which has now been released publicly. The dataset currently includes information on 126,216 exomes and 15,136 whole-genome sequences.
ExAC was established to help overcome some of the challenges that researchers have faced in the past when trying to tap into variant data found in the massive amounts of genome and exome sequence that have been generated around the world, from issues related to informed consent or regulatory constraints to subtle differences in the pipelines used to call variants, MacArthur explained. Since its launch in October 2014, the ExAC site has been viewed nearly 6 million times. Variant data gleaned from the collection has been used by investigators focused on understanding features found in protein-coding regions of the human genome as well as those filtering variants to focus in on disease- or trait-related variants. Details on the ExAC resource and its applications were published earlier this year in Nature.
For the new gnomAD collection, MacArthur and his colleagues called variants in the available exomes and genomes separately using consistent variant calling processes, but ultimately analyzed the sequences together. So far, they have identified nearly 18 million variants in the expanded set of exome sequences, including 7.5 million variants not described previously. The whole-genome sequence data has yielded more than 254 million variants. Almost 160 million of the variants found from whole-genome sequences are novel.
Along with the variant coverage available in gnomAD, MacArthur touted the diversity of the dataset, which represents individuals from a wide range of ancestry groups and includes sequences for some 5,000 individuals of Ashkenazi Jewish descent.
MacArthur also cautioned, however, that the gnomAD website is currently in its beta version and urged users to report any unusual variant calls or bugs to the team as it works to continue improving the site. The group plans to finalize quality control and variant filtering for the dataset shortly and will release non-coding variant information from gnomAD in the coming weeks. Protein-coding variants from the collection are already available.
Researchers are not restricted with regard to use of the gnomAD data and publications stemming from these analyses, MacArthur said. But he urged those intent on doing large-scale analyses with the data to contact the team beforehand to avoid duplicating the efforts of other groups.