NEW YORK (GenomeWeb) – Using a suite of tools and technologies, researchers from the Human Genome Structural Variation Consortium have catalogued the haplotype-resolved structural variation within three parent-child trios, including variants missed by standard approaches.
Structural variants can be tricky to uncover using short-read sequencing and common detection algorithms. But using a combination of short- and long-read sequencing as well as strand-specific sequencing and optical mapping approaches, alongside multiple algorithms, an international team of researchers uncovered an average 818,054 indels and 27,622 structural variants per genome in three parent-child trios. As they reported in Nature Communications today, their findings represented a three- to sevenfold increase in SV detection over conventional sequencing approaches.
"We employ multiple state-of-the-art sequencing technologies and methods to capture the full spectrum of genetic variation down to the single-nucleotide level, in a haplotype-aware manner," senior author Charles Lee, a researcher at the Jackson Laboratory for Genomic Medicine, and his colleagues wrote in their paper. "Our results indicate that with current methods, using multiple algorithms and data types maximizes SV discovery."
In this study, the researchers focused primarily on the children within the three parent-child trios of Han Chinese, Puerto Rican, and Yoruban ancestry they analyzed. They sequenced each child's genome to an average 223-fold coverage using a combination of Illumina short-read sequencing with different library types, Pacific Biosciences' long-read sequencing, and Bionano Genomics' optical mapping. They also generated Oxford Nanopore long-read sequencing data to validate some of the PacBio structural variant calls.
They further analyzed long-range phasing and haplotype structure using 10x Genomics' Chromium and Illumina's synthetic long reads as well as the Hi-C and Strand-seq tools and algorithms like WhatsHap, StrandPhaseR, and LongRanger.
Overall, the researchers detected an average 818,054 indels and 27,622 structural variants per genome, including 156 inversions. They noted that combining different algorithms and data types appeared to maximize structural variant detection and estimated that combining data types enabled them to detect up to sevenfold more structural variants than conventional sequencing approaches.
They similarly calculated that common short-read calling algorithms missed about 83 percent of insertions, particularly ones within tandem repeats and retro-transposon insertions that are between 50 base pairs and 2 kilobase pairs in size. Likewise, they identified 181 inversions missed by the 1000 Genomes Project-Phase 3.
Fifty-eight of the inversions the researchers detected overlapped at least in part with regions associated with recurrent microdeletion and microduplication syndromes. The researchers postulated that these inversions could predispose the loci to undergo pathogenic microdeletions or microduplications.
While the researchers presented their methods and results as a gold standard, they noted that it's not practical in most cases for investigators to use this many technologies and algorithms in their work, and suggested that their report instead serve as a guide to balance the cost of sequencing and sensitivity to detect structural variants.
When they assessed which technologies and algorithms, or combinations thereof, were best suited to detect indels and SVs, they reported that with high-coverage Illumina sequencing and most of the algorithms they used, about 52 percent of total deletion structural variants and about 18 percent of insertion structural variants could be detected.
Through a series of downsampling analyses to assess how typical large-scale studies would fare, the researchers found that relying on 30X coverage and using the algorithms' default parameters would lead to an 11 percent reduction in calls because of the lower coverage and a 23 percent reduction due to using default parameters.
"Our analyses indicate that the contribution of SVs to human disease has not been comprehensively quantified based upon studies that have relied upon short-read sequencing," Lee and his colleagues added.