NEW YORK – Researchers from the Trans-Omics for Precision Medicine (TOPMed) program have detected hundreds of millions of variants, many of them novel, in the genomes of more than 50,000 individuals, creating a resource to help elucidate the genetic architecture of heart, lung, blood, and sleep disorders in order to improve diagnosis, treatment, and prevention.
The initial phase of the program, which was started by the National Heart, Lung, and Blood Institute in 2014 and included almost 54,000 samples from participants in 33 NHLBI-funded research projects, focused on sequencing the genomes of individuals from diverse backgrounds. Its first dataset, posted as a preprint a year ago and published in Nature on Wednesday, includes more than 410 million variants, 78.7 percent of which had not been described before.
Some of these novel variants were detected through the assembly of unmapped reads and customized analysis in highly variable loci, the researchers said. Ninety-seven of the 410 million detected variants had frequencies of less than 1 percent, and 46 percent were singletons that were present in only one individual.
"These rare variants provide insights into mutational processes and recent human evolutionary history," the authors wrote. "The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation."
Among their findings, the researchers noted that the fraction of singletons in each region or class of sites closely tracked functional constraints. For example, among all 4,651,453 protein-coding variants in unrelated individuals, the proportion of singletons was the highest for frameshift variants, still high among putative splice and truncation variants, intermediate among nonsynonymous variants, and lowest among synonymous variants. Beyond protein-coding sequences, they found increased proportions of singletons in promoters, 5' untranslated regions, regions of open chromatin, and 3' untranslated regions, and lower proportions of singletons in intergenic regions.
To evaluate whether their data could be used to generate more comprehensive variation datasets, the researchers then went on to develop a method based on de novo assembly of unmapped and mismapped read pairs, enabling them to assemble sequences that were present in a sample but absent, or improperly represented, in the reference. In total, they placed 1,017 ancestral sequences, and were able to fully resolve 713 of those. They ranged in length from 100 base pairs to 39 kilobases, and accounted for a total of 528 kilobases. Out of these 1,017 events, 551 occurred within GENCODE v.29 genes, they added.
The investigators also ordered TOPMed participants by population group and calculated genetically determined ancestry components, heterozygosity, number of singletons, and rare variant sharing. They found that African American and Caribbean population groups had the greatest heterozygosity, followed by Hispanic/Latino, European American, Amish, East Asian, and Samoan groups. This was consistent with a gradual loss of heterozygosity tracking the recent African origin of modern humans and subsequent migrations from Africa to the rest of the globe, the researchers said.
Finally, the team noted that in addition to enabling detailed analysis of the TOPMed samples, the program can enhance the analysis of any genotyped samples. To that end, the researchers constructed a TOPMed-based imputation reference panel that now includes 97,256 individuals, as well as more than 308 million SNVs and indels.
"This is, to our knowledge, the first imputation reference panel that is based exclusively on deep [whole-genome sequencing] data in diverse samples and greatly exceeds previously published alternatives," the authors wrote.
Overall, they concluded, the TOPMed sequencing data provides a resource for developing and testing methods for analyzing human variation, for inference of human demography, or for exploring genome function. Further, they added, TOPMed data can improve nearly all ongoing studies of common and rare disorders by providing a deep catalogue of variation in healthy individuals and an imputation resource that enables array-based studies to "achieve a completeness that was previously attainable only through direct sequencing."