NEW YORK – Long-read sequencing technologies are not a catch-all solution for identifying structural variants with whole-genome sequencing and methods using long reads currently have "blind spots" in important variant classes, according to a new study.
"Long reads have considerable added value, but they're not a panacea," said Michael Talkowski, a researcher at the Broad Institute and Massachusetts General Hospital and corresponding author of the study comparing structural variant (SV) detection using alignment-based methods, primarily using short reads, and assembly-based methods, primarily using long reads.
In collaboration with researchers from the Human Genome Structural Variation Consortium, his lab analyzed SVs in three matched trio families from the 1000 Genomes Project. They published their study results last week in the American Journal for Human Genetics.
Assembly-based methods detected, on average, nearly 25,000 SVs per genome — more than twice as many haplotype-resolved SVs than the most sensitive alignment-based methods; however, genomic context played an important role in the ability to detect SVs. The methods showed high concordance for deletions outside of simple repeat and segmental duplication regions, but overall concordance was low.
The study found an average of 167 large copy number variants (CNVs) per genome that are identified only with methods that rely on using sequencing depth, 88 percent of which were not detected by assembly-based methods. CNVs are strongly enriched for pathogenic variation, the researchers said and "[appear] to be a significant blind spot for long-read assembly technologies."
Long reads provide an advantage over short reads mostly in detecting variants within regions characterized by segmental duplications and simple repeats. Over 90 percent of SVs detected by long reads alone localize to those regions, which only make up just under 10 percent of the GRCh38 human reference genome. And long reads were superior for detecting insertions regardless of genomic context, the study said.
"There is a bit of a cliche that you need long reads for SVs and that long reads will solve all SVs," Marcin Imielinski, a researcher at the New York Genome Center and Weill Cornell Medicine, who has studied the role of structural variants in cancer using both short- and long-read technologies and was not involved in the study, said in an email. "I think this study approaches this question critically, and actually shows that for certain classes of SVs (e.g., deletions) short reads do very well, and for others they give a precise footprint of what's missing."
Imielinski added that these "'known unknowns' are actually very important pieces of data. They tell us how far we have to go, either with bench technology — longer read or molecule lengths — or analytic methods" for long reads.
The study comes as long-read methods are catching up in terms of accuracy to short-read sequencing, primarily from Illumina, prompting some diagnostics providers to explore their use in whole-genome sequencing. Since introducing HiFi sequencing, which can reach accuracy greater than 99.9 percent, Pacific Biosciences has inked diagnostic partnerships with Invitae and Children's Mercy around WGS. Oxford Nanopore Technologies has released a new basecalling algorithm that it says offers accuracy of 98.3 percent and is planning a line of instruments for regulated applications, including clinical sequencing.
Long-read sequencing provides the ability to sequence along much larger stretches of the genome, including regions intractable to short-read technologies, suggesting to some that SVs are better analyzed that way. But long-read sequencing costs are higher and its throughput lower than short-read sequencing.
"Our goal here was to tell everybody what to expect" when detecting SVs using various methods, Talkowski said. The analysis, led by postdoc Xuefang Zhao, centered on insertion and deletion data from previous studies from the Human Genome Structural Variant Consortium.
The authors noted that their study did not consider some variant types, including inversions, translocations, and balanced and complex SVs because they were not uniformly called by long-read whole-genome sequencing assembly algorithms. "It was a choice because the data weren't there to do so" when the project started, Talkowski said, noting that more recent studies may contain that data, but, overall, indels are the predominant form of SVs.
Using CNVs, another clinically relevant SV, as a benchmark was also important, Imielinski said, "since most SVs create a change in copy number and genomic connectivity." Long reads may not be able to map copy changes to rearrangements, thus missing them, while short reads might detect such variants through a change in read depth.
"By challenging long-read methods to explain these changes, this study is really taking long-read methods to task," he said. The blind spot in CNVs "shows that the analytic methods have a long way to go."
Talkowski stressed that technologies and methods are separate considerations. "I would not go as far as to say long-read technologies missed these [copy number] variants," he said.
Read depth-based methods for variant calling are technically feasible with long reads, "but there aren't great methods to take advantage of that, yet," Imielinski said, adding that the breakpoints that explain these CNV calls should exist somewhere in the long-read data, but current mapping or assembly methods aren't finding them. "That points to another methodological blind spot that could be improved with better long read assembly and mapping," he said. Those breakpoints could also exist in regions so repetitive that even longer reads or optical mapping data would be needed to identify them, he noted.
"We need to apply new methods and improve [long-read] technology, and hopefully we can capture a much larger fraction of the genome than we can right now," Talkowski said. Variant interpretation will also play an important role as these technologies head towards clinical use. "Most of the unique SVs observed from long reads are within regions where we don't have great ways to interpret the impact of their variation," he noted.
"I suspect that long-read assemblies will eventually make significant added value both in genome biology and disease association as well as clinical interpretation, in time," Talkowski said. "But it's early days and there are many activities we need to pursue to bring [long-read technology] to the fore of our diagnostic approaches."