NEW YORK (GenomeWeb News) – A Beijing Genomics Institute-led team has gotten a bit closer to the goal of creating a "pan genome" representing genome sequence from diverse human populations.
The team used short-read assembly approaches to put together new Asian and African genomes, which they then compared with the current human reference sequence. The result: about five million bases of sequence not found in the reference.
Based on the findings, which appeared online last night in Nature Biotechnology, the team speculated that future efforts to develop a human pan genome could turn up between 19 and 40 million bases of human sequence beyond that in the existing reference.
"Our study shows that combining individual-specific sequences with shared core human genome sequences will enable the creation of a human pan-genome that will be important for better understanding personal genomes and their use in medical genomics studies," co-senior authors Jun Wang and Jian Wang, from BGI-Shenzhen, and their co-authors wrote.
The team did new short-read assemblies using data from Asian and African genomes that were sequenced last fall. Sequence reads for both the Asian genome (sequenced by researchers at BGI) and the African genome (sequenced by Illumina researchers) were generated with the Illumina Genome Analyzer.
The resulting Han Chinese genome contained 2.87 billion bases while the Yoruban genome was 2.68 billion bases in size.
When they compared the genomes to the NCBI reference genome, the researchers found that the newly assembled Chinese genome contained 5.1 million bases not found in the reference, while the African genome contained 4.8 million bases of non-reference sequence.
Additional comparisons suggested about 82 percent of the new Han Chinese sequence and more than 89 percent of the new African sequence are either present in both genomes or correspond to sequence reads from the Watson genome, Venter genome, or other human sequences in GenBank. Another 311,500 bases of the new Asian sequence and 176,500 bases of new African sequence were homologous to sequences in mammalian genomes.
In their subsequent analyses of the genomes, the team looked at the prevalence of insertions and deletions in the newly assembled genomes.
They also began applying the data to answer questions about human population and migration patterns, honing in on 164 newly detected sequences that did not overlap between the Asian and African genomes.
Using PCR, the researchers amplified these sequences from Human Genome Diversity cell line DNA representing 351 individuals from 41 populations. Phylogenetic analyses of these sequences identified clusters that corresponded geographically to where samples had been collected, the team reported.
While the overall pattern in these sequences was consistent with an out of Africa migration, their analyses also uncovered new patterns that couldn't be detected from mitochondrial and Y chromosome DNA studies.
For instance, they found that a sequence frequently found in the San population in southern Africa is less and less frequent in northern Africa. The frequency of this sequence apparently dwindles even more in populations outside of Africa, disappearing in European, Oceanic, and Native American populations.
In contrast, the team noted, they found a sequence that is more prominent with geographic distance from Africa. Still other sequences had less straightforward patterns, decreasing in East Asian and Oceanic populations compared with African populations but turning up again in European populations.
Along with population patterns, the genomes also provided hints about how genome sequence varies from one individual to the next. For example, while the Asian and African individuals' genomes differed by about four million bases (not including SNPs), the researchers found about a 1.8 million base difference between the Chinese genome and a preliminary Korean genome sequence assembly.
Since this genome diversity coincides with SNP differences, the team explained, it should be possible to get a rough idea of differences between individual genomes by extrapolating from SNP data.
Taking this a step further, the team predicted that a human genome sequence representing more than six billion people (roughly the world's population) would contain somewhere between 19 and 40 million bases of sequence not found in the current human reference.
The findings also may have implications for how researchers assess genetic variation in the future, the team argued, suggesting widespread sequencing will be needed to fully understand individual and population-based differences in the human genome.
"Based on our findings here, it is also clear that establishing a complete human pan-genome will require using extensive sequencing data rather than relying on array-based technologies that are dependant on the current reference genome," they wrote.