NEW YORK (GenomeWeb) – Some large structural variants in the human genome exhibit population-specific patterns, according to a new analysis of more than 150 genome maps.
Large structural variants — those that are bigger than 2 kilobases — are difficult to detect, especially as short-read sequencing technologies are the most commonly used tools in genomic analysis.
For their study, Pui-Yan Kwok from the University of California, San Francisco and his colleagues analyzed optical genome maps generated for more than 150 individuals representing more than two dozen populations. A phylogenetic analysis of these maps indicated that some SVs and CNVs show variable population patterns. The researchers were also able to characterize SVs in typically intractable regions of the genome, including spots not covered by the human reference genome. Their results were published yesterday in Nature Communications.
"The ethnically diverse study reveals that one reference genome does not fit all, and that it is impossible for a genome analysis based on shortread sequencing alone to correctly characterize all clinically relevant genome variation at the root of human disease in individuals across different populations," said Sven Bocklandt, head of scientific affairs at Bionano Genomics, whose optical mapping approach was used in the analysis, in a statement.
Kwok and his colleagues generated optical genome maps for three men and three women from 26 different populations collected by the 1000 Genomes Project. They were able to map 93 percent of the genome, about 2.87 gigabases, with the inaccessible regions largely representing centromeric, pericentromeric, and telomeric regions.
For 13 of the most genetically diverse subpopulations, the researchers also generated linked-read sequence data using 10x Genomics' Chromium platform, garnering about 60X coverage on average.
When they placed the 154 maps they generated into a consensus assembly — which covered 99.3 percent of the accessible reference genome — the researchers found that much of the accessible genome was well covered by the individual maps, but that about 88 megabases were structurally diverse.
In all, they identified a median of 1,539 large indels in each sample that spanned 14.2 megabases. For the 144 samples they analyzed that were also studied by the 1000 Genomes Project, about a quarter of the large indels they found were also reported by that group. After comparing their findings to other databases, they reported that 34 percent of their SVs were novel.
They also performed a phylogenetic analysis based on these large indels to find that between 30 percent and 40 percent of them are shared across each of the five human super-populations, while about a quarter to a third are shared by some super-populations and between 22 percent and 44 percent are unique to one population.
When they then analyzed population patterns of CNVs, they noted varying copy-number levels based on population, with East Asians having the highest load. One variable CNV affects serum pepsinogen levels, the researchers noted, a predictor of gastric cancer, which is more common among East Asian populations.
The researchers compared their maps to the human reference genome to uncover about 60 megabases' worth of sequence that could not be aligned to the reference . But as some of this genomic content was present in other published sequence assemblies, the researchers suspected that these sequences were real, even if not present in the reference genome donors.
This suggested to them that "the reference genome represents only one haplotype among many," and that analysis of additional haplotypes is needed.