SAN FRANCISCO (GenomeWeb) – Longer sequencing reads are better at detecting structural variants than shorter reads, and a new study illustrates why that is and identifies thousands of novel variants from well-characterized samples.
In addition, the study, published this week in Nature Methods, identifies differences between Pacific Biosciences' and Oxford Nanopore Technologies' sequencing platforms.
In the study, a research team led by Michael Schatz, an associate professor of computer science at Johns Hopkins University, described two open-source tools his group developed specifically to call structural variants from long-read sequencing instruments, and applied them to well characterized genomes.
Schatz, who has been working with both Pacific Biosciences and Oxford Nanopore Technologies' sequencing instruments for genome assembly, said, "It's clear that these technologies are really transformative for certain problems" such as sequencing through repetitive regions to enable "better connectivity." About two years ago, he said he wanted to see whether longer reads would offer benefits for "increasingly complicated samples, like a cancer genome with lots of variability."
However, when the team began using available tools for calling structural variants, "we weren't very successful," Schatz said. He attributed that to the fact that the technologies are so different. Illumina instruments rely on paired end reads with lengths around 100 to 1,000 times shorter. "That changes the data characteristics a lot," Schatz said, "so we decided to start from scratch" in designing tools.
The tools include an aligner, NGMLR, as well as a structural variant caller, Sniffles. The NGMLR alignment tool first partitions reads into subsegments and aligns those to the reference. It then groups those subsegments into longer reads and creates possible alignment graphs. The tool then scores the possibilities, choosing the best one.
Schatz said that the NGMLR aligner, primarily developed by the co-first authors on the study, Fritz Sedlazeck and Philipp Rescheneder, first took an approach used in short-read aligners — seed and extend. But, the key innovation is that it takes into account the types of data generated by the PacBio and Oxford Nanopore platforms.
"Both have errors that are largely in gaps — insertions and deletions," Schatz said. "So, the aligner needed to be set up so that it knew those indels would be present." But, also, it needed to take into account the fact that the data would have real insertions and deletions. "We needed to balance these two factors," he said. To do this, the group developed the scoring system for the gaps.
Sniffles, the structural variant caller, scans the alignment to look for structural variants. With clean alignments, those variants are much easier to pick out, Schatz said.
In the study, the team first benchmarked their tools against other approaches on simulated reads with known structural variants of different sizes and types. They also tested them on Arabidopsis thaliana samples and Ashkenazi human trio sequencing data from Genome-in-a-Bottle. When comparing PacBio with Illumina data, the group found Mendelian discordance rates of 5.6 percent for PacBio on the human trio and 21 percent for Illumina, noting that translocations were particularly problematic for Illumina data.
Next, the team compared PacBio and Oxford Nanopore sequencing to each other and also to Illumina, testing the technologies with the new tools on the well-characterized genome NA12878.
Sniffles called 15,499 structural variants from the PacBio data and 26,657 SVs from Oxford Nanopore data, while the short-read structural variant caller SURVIVOR called 7,275 SVs from Illumina sequencing data.
Of the PacBio calls, nearly 95 percent were confirmed by Oxford Nanopore, Illumina, or other existing data sets. The Oxford Nanopore data had much lower concordance, with 11,433 of the called SVs, or 43 percent, unique to that dataset. Of those, the vast majority were found within homopolymers or repeats.
Schatz noted that most of the errors in the Oxford Nanopore data were deletions. "A lot of those are related to the base caller," he said, and the difficulties associated with distinguishing between bases in a long homopolymer run.
Interestingly, the types of errors produced by Oxford Nanopore and PacBio were totally different. Whereas, the vast majority of Oxford Nanopore errors were deletions in homopolymer regions, by contrast, of the 773 structural variants, or 5 percent of SV calls, that were unique to the PacBio data, most were small insertions. That's related to the biophysics of the system, Schatz said, caused by fluorescently tagged nucleotides traveling into the zero mode waveguides and being imaged even though they are not being incorporated into the DNA synthesis.
Aaron Wenger, a principal investigator on PacBio's bioinformatics team, said that the study and the tools developed could help move the field toward a "convergence of community standards," on par with standards that have been developed for short-read sequencing. The study demonstrates a "maturation" in the long-read sequencing field, he said.
These "third-party benchmarks" like the Genome-in-a-Bottle datasets, "help define gold standards and truth sets," so that others can "determine whether they are doing a good job of detecting SVs and whether the SVs they're detecting are real or artifact," Wegner added.
He noted that PacBio's internal aligner uses NGMLR, and that the firm has its equivalent of the Sniffles SV caller that it recommends.
Wenger also highlighted the fact that 95 percent of the SVs called from the PacBio data were confirmed, which is important because it helps reduce the costs and time associated with false discoveries.
Oxford Nanopore declined to comment on the study.
Nick Loman, a professor at the University of Birmingham who was not affiliated with the study, said that the "software and refinements to long-read alignment seem good and important." He noted that the large number of false-positive indels from the nanopore data should be improved upon with newer versions of the base caller that have since been released, an assessment Schatz agreed with.
Critically, the researchers also compared the long-read data with the short-read data. They found that from the NA12878 genome, they were able to rule out the vast majority, 83 percent, of translocations called from the Illumina data. They found that most of those translocations overlapped with insertions that were detected in one or both of the long-read platform's data, concluding that the false Illumina translocation calls were likely due to mismapped reads across insertions.
Next, the researchers examined the tools on more complex samples from a breast cancer cell line. For this sample, they only tested NGMLR and Sniffles with PacBio data, identifying 15 gene fusions, all of which were validated with PCR.
Schatz said he is continuing to apply these tools with both PacBio and Oxford Nanopore sequencing technology. In collaborations with Dick McCombie and David Spector at Cold Spring Harbor Laboratories and Northwell Health, as well as Winston Timp's group at Johns Hopkins, Schatz said the researchers are continuing to sequence the breast cancer cell line using long-read sequencing technology as well as patient tumor samples.
"We're finding tens of thousands of variants that are missed using short-read sequencing," Schatz said, including in "really important genes like BRCA1." The researchers are currently trying to understand these novel variants and whether they have a functional role.
Outside of cancer, Johns Hopkins has a pilot project to evaluate patients with suspected genetic disease but who have not been diagnosed by other means, including exome sequencing, using long-read sequencing.
In addition, he noted that his group is collaborating with the National Human Genome Research Institute's Encode project on basic research to evaluate structural variants that have been identified through long-read sequencing to determine whether and how those variants impact gene expression and regulation.
Schatz and his colleagues are not the only group that is turning toward long-read sequencing technology to study structural variants. For instance, a Jackson Laboratory team led by Chia-Lin Wei also developed a bioinformatics tool specifically for long-read sequencing technology that it is using in conjunction with Oxford Nanopore technology to sequence cancer samples. That work, previously described on the BioRxiv preprint server, was also published in Nature Methods this week. And, Loman noted that his group did SV calling on a nanopore-sequenced human genome using SVTyper, a tool originally described in 2015 by researchers from Washington University, the University of Utah, and the University of Virginia.
Schatz said that the advances in long-read sequencing technology are "really exciting" for the field and that both PacBio and Oxford Nanopore have promising roadmaps to reduce the costs of sequencing on their platforms, including Oxford's PromethIon and PacBio's new version of its chip. He thinks the technology holds a lot of promise for characterizing structural variants, noting that the amount of novel variation being discovered is more than has been previously reported by SNVs. "We can finally see things that we've never seen before," he said.