NEW YORK (GenomeWeb) – Using genome sequence data from more than 900 individuals of African ancestry, a team led by researchers at Johns Hopkins University has identified hundreds of millions of bases that are not represented in the human reference genome.
The researchers analyzed deep whole-genome sequence data from 910 previously sequenced African individuals, uncovering more than 296 million bases of sequence that did not align to version GRCh38 of the human reference genome. As they reported today in Nature Genetics, the researchers then assembled those non-reference sequences into more than 125,700 "African pan-genome contigs," including hundreds of contigs that overlapped with and altered protein-coding genes.
"Overall, these results suggest that a single reference genome is not adequate for population-based studies of human genetics," senior and co-corresponding author Steven Salzberg, a computational biology, biostatistics, and biomedical engineering researcher at Johns Hopkins, and his colleagues wrote. "Instead, a better approach may be to create reference genomes for all distinct human populations, which over time will eventually yield a comprehensive pan-genome capturing all of the DNA present in humans."
The GRCh38 reference genome represents more than 3 billion bases of sequence data, the team noted. And after more than a decade of ongoing improvements to it, gaps in the human genome have been shrinking, while efforts such as the Genome Reference Consortium have been attempting to flesh out forms of genomic variation with alternate loci. Even so, the work to accurately represent a broader range of human populations in the human reference genome is not finished.
"Despite these efforts, the current human reference genome derives primarily from a single individual, thus limiting its usefulness for genetic studies, especially among admixed populations, such as those representing the African diaspora," the authors wrote, noting that the "lack of diversity in the reference genome poses many challenges when analyzing individuals whose genetic background does not match the reference."
In a statement, Salzberg further noted that one problem with relying on a single reference genome is that "when a particular DNA analysis doesn’t match the reference and you throw away those non-matching sequences, those discarded bits may in fact hold the answers and clues you are seeking."
The researchers brought together nearly 1.2 trillion paired-end reads for the analysis — the 910 individuals included in the study had their genomes sequenced to an average depth of 30- to 40-fold by the Consortium on Asthma among African-Ancestry Populations in the Americas (CAAPA).
The team's analysis suggested that an African pan-genome bolstered by these new sequences contained roughly 10 percent more DNA than found in the GRCh38 human reference genome. It placed the 296.5 million bases of sequences that resisted alignment with the reference genome into 125,715 contigs, including 33,599 contigs present in the genomes of two or more of the CAAPA participants and tens of thousands of contigs that closely resembled sequences in Korean or Chinese genomes.
Each of the CAAPA participants carried an average of 859 African pan-genome contigs, the researchers reported, and these 387 inserted sequences overlapped with known genes, producing 315 distinct protein-coding gene compositions.
And when they looked at the distribution of the African pan-genome sequences in genomes from a dozen more African individuals and a dozen European individuals enrolled in the Simons Genome Diversity Project, the investigators found that while some of the sequences turned up in individuals of European ancestry, they were far more common in the African individuals profiled.
From these and other findings, the authors suggested that some of the African pan-genome contigs detected in the new study "have been lost in the small number of individuals used to create GRCh38, although some of them may reside in the few remaining gaps in the genome," and concluded that the analyses "demonstrate that the standard human reference genome lacks a substantial amount of DNA sequence compared with other human populations."