NEW YORK – Members of the Chinese Pangenome Consortium (CPC) have published new research on dozens of populations in China, putting together a draft pangenome reference that contains missing sequences not found in the past, including sequences originating in archaic hominins.
The findings appeared in Nature on Wednesday.
"[W]e attempted to uncover missing sequences and hidden variations that have not been identified before in Chinese ethnic groups," co-senior and co-corresponding author Shuhua Xu, a human population omics group researcher affiliated with Fudan University, the Chinese Academy of Sciences, ShanghaiTech University, and Jiangsu Normal University, said in an email, adding that the CPC pangenome reference "undoubtedly provides a more comprehensive understanding of genomic variation in Asian populations, particularly those of Chinese ancestry."
Using Pacific Biosciences or Oxford Nanopore Technologies long sequence reads, together with linked reads, Hi-C data, and Illumina short reads, the researchers put together high-quality, haplotype-phased de novo genome assemblies for 116 individuals from three dozen minority Chinese ethnic groups that have been underrepresented in prior research efforts.
In the process, they unearthed some 15.9 million small variants, including single-nucleotide variants and small insertions or deletions, along with 78,000 structural variants — a set that included around 5.9 million small variants and 34,000 SVs that had not been identified in the past.
Relative to the GRCh38 reference sequence, the team's new pangenome sequences represent an additional 189 million bases of polymorphic sequence data on euchromatic sequences, Xu explained, while flagging 1,367 duplications involving protein-coding gene sequences.
"[A]bout 18.4 percent of the small variants and 17.1 percent of the SVs identified were specific to the CPC assemblies compared with a recently released pangenome reference by the Human Pangenome Reference Consortium (HPRC)," he explained. "These newly identified genomic variations are more informative and thus can facilitate uncovering finer-scale population relationships, as the majority of the novel variations are population-specific."
The team was able to achieve improved Chinese genome sequence alignments using the CPC pangenome reference relative to alignments possible with an available reference from the Human Pangenome Reference Consortium.
"Compared with the HPRC graph reference, using the CPC graph reference improved the perfect alignment rate of short reads in East Asian samples," Xu noted, explaining that this improved alignment "would also help to improve the accuracy of profiling parts of the genome enriched with complex sequence variations (such as genes regulating the immune system)."
The new sequence data is expected to provide an enhanced understanding of sequences behind specific traits and conditions, Xu explained, including missing complex disease heritability and associations traced back to genes and genetic variants originating in archaic hominin sequences.
"Overall, such efforts would aid genomic analysis for human evolutionary and medical research," Xu said, noting that the current work "is just the first step" toward the team's goal of establishing a comprehensive high-quality genome reference for populations in China and other parts of Asia.