NEW YORK – An international team led by investigators in the US and Germany has demonstrated the extensive structural variation that can be found in haplotype-resolved human genomes assembled with the help of long-read sequence data.
"This improved understanding of the genome allows us to identify new hotspots of genetic instability that will be important for predicting where and why disease occurs — especially rare variants," co-senior and co-corresponding author Evan Eichler, a genome scientist at the University of Washington School of Medicine, said in an email.
As they reported in Science on Thursday, the investigators used continuous long-read or high-fidelity long-read Pacific Biosciences sequencing, coupled with single-cell template strand sequencing (Strand-seq) on tens of thousands of individual cells, to put together dozens of new, high-quality genome assemblies with phased haplotypes.
"The work provides fundamental new insights into the structure, variation, and mutation of the human genome," the authors wrote, "providing a framework for more systematic analyses of thousands of human genomes going forward."
With 64 haplotype assemblies — representing 32 individuals from more than two dozen populations in Africa, the Americas, East Asia, South Asia, and Europe — they uncovered a broad swath of structural variants that had been missed in the past using short-read genome sequencing methods, including almost 107,600 insertion or deletion variants, more than 300 inversions, and millions of small indels or single base changes.
The refined look at the genomes revealed structural variants that are missed by conventional whole-genome sequencing approaches, Eicher explained. And because such variants appear to be overrepresented in individuals with certain diseases, he added, "[t]here are a large number of patients with undiagnosed disease which will need to be investigated."
The team's findings also highlighted 278 apparent structural variant hotspots in the human genome, while offering a look at the mechanisms that contribute to new structural variants and some of their regulatory and functional consequences — from rare variants that alter gene function to those found in regulatory regions of the genome.
For example, the researchers tracked down more than 2,100 structural variant-based expression quantitative trait loci (eQTLs) influencing the expression of 1,526 genes by applying a genotyping method based on the haplotype-resolved sequences to available RNA sequence and short-read genome data.
Beyond insights into disease biology, the growing structural variant set is expected to provide a clearer look at the potentially beneficial variants that are overrepresented in individuals from human populations that have adapted to distinct environments.
"With these new reference data, genetic differences can be studied with unprecedented accuracy against the background of global genetic variation, which facilitates the biomedical evaluation of genetic variants carried by an individual," co-first author Peter Ebert, a researcher at Heinrich Heine University Düsseldorf, Germany, said in a statement.
Eichler noted that the Human Genome Structural Variation Consortium and Human Pan Genome Project are planning to apply similar approaches to come up with some 500 more haplotype-resolved human genomes.
Investigators involved in the current study are also continuing to pursue human genome assemblies that are resolved at even higher resolution, Eichler said, with the goal of completing each and every base pair and assigning its parental origin, from one telomere of a chromosome to the other.
"As costs go down and the technology improves, I believe this approach will ultimately replace commercial whole-genome sequencing with short reads," Eichler said, explaining that assemblies phased with long reads "provide access to variants that we have never seen before."