NEW YORK (GenomeWeb News) – In a paper scheduled to appear online this week in the Proceedings of the National Academy of Sciences, researchers from the Harvard School of Public Health used a mathematical method to estimate how many undetected variants remain in the human genome — and how many individuals would have to be sequenced to detect specific fractions of those variants in different human populations.
The team applied a parametric beta-binomial model to data from three large-scale studies: ENCODE, SeattleSNPs, and National Institute on Environmental Health Sciences SNPs. As previously reported, the researchers found that African genomes tested had the greatest SNP diversity while Asian genomes have the least, with Europeans in between.
And based on the abundance of SNPs detected in the SeattleSNPs and NIEHS SNPs projects compared with the ENCODE project, the researchers concluded that parts of the genome involved in environmental and inflammatory responses tend to harbor enhanced genetic diversity.
Human genome sequencing projects and related research efforts have given scientists an increasing appreciation of the level of variation in the human genome. And the 1000 Genomes Project, an effort to sequence the genomes of 1,000 individuals, is currently underway to catalogue both common genetic variants — those found in at least one percent of individuals — and copy number variants.
But researchers are trying to determine how many undetected variants remain in the human genome and how many individuals need to be sequenced to see all or most of these variants.
In the PNAS paper, lead author Iuliana Ionita-Laza, a biostatistician at Harvard School of Public Health's Department of Biostatistics, and her colleagues applied a mathematical approach called a parametric beta-binomial model to sequence data from the ENCODE, SeattleSNPs, and NIEHS SNPs datasets and compared genetic diversity estimates both within and between the different projects.
For instance, in the ENCODE dataset, which includes 500,000 bases of sequence data at ten regions of the genome in 16 Yoruban, 16 European, eight Han Chinese, and eight Japanese individuals and is intended to represent the genome as a whole, the researchers found that African individuals had the most diverse genomes.
In contrast, Asian individuals appear to have the lowest genomic diversity, with Japanese individuals having lower sequence diversity than Chinese individuals. European individuals, meanwhile, had intermediate genomic diversity.
Based on ENCODE data, the team predicted that sequencing 154 European individuals would capture about 80 percent of rare variants — found in at least 0.1 percent of individuals — in this population. On the other hand, sequencing 1,008 individuals is expected to uncover 99 percent of the variants. But, the researchers found, it would likely take sequencing more than 3,500 Europeans to find all of the rare variants in that population.
Next, they looked at data from the SeattleSNPs project, which evaluated inflammatory response genes by assessing roughly 1.6 million bases of reference sequence from 76 genes in 24 Yoruban and 23 European individuals, The researchers found that using ENCODE data to predict SNPs leads to an underestimate of genetic variation in the SeattleSNPs dataset.
Similarly, SNP estimates based on ENCODE data are expected to under-estimate genetic diversity in data from the NIEHS SNPs project, which sequenced 293 environmental response genes in 27 individuals of African descent, 22 individuals each of European and Hispanic descent, and 24 individuals of Asian descent, including a dozen Han Chinese and a dozen Japanese.
The team suggested that such results were attributable to the differences in regions being evaluated. Whereas ENCODE data contains information representing the genome at large, the other two projects looked specifically at inflammatory (SeattleSNPs) and environmental (NIEHS SNPs) response genes.
That, in turn, led the team to conclude that environmental response genes harbor a much higher diversity than the genome. And, they noted, African samples showed an even greater genetic diversity in these regions than Asian or European samples.
Based on their analyses, the team concluded that the 1000 Genomes Project will likely uncover most common genetic variants as well as many of the rare genetic variants. To find most or all of the rare variants, though, the researchers predicted that they would have to sequence more than 3,000 individuals.
"The number of individuals necessary to capture all common variation (frequency at least one percent) is small and the 1000 Genomes Project is likely to find most of them, subject to sequence accuracy," Ionita-Laza and her colleagues wrote. "Even for the rarer variants (frequency at least 0.1 percent), a large proportion of them can be found with small samples (in the low hundreds), but to find all of them, thousands of individuals are necessary."
The team noted that it should be possible to extend their mathematical approach to not only estimate the number of SNPs in the human genome but also other types of genetic variation.
"Although we applied the approach to SNP data, the method applies equally well to counting other types of variants, including copy-number variants," Ionita-Laza and her colleagues concluded. "This is particularly useful because currently much less is known about copy-number variants than about SNPs."