BALTIMORE – Researchers at the Dana-Farber Cancer Institute and Harvard Medical School have found a way to remedy frequent basecalling errors in nanopore sequencing of telomere regions by tuning the basecalling algorithm.
In a study published last month in Genome Biology, they demonstrated that nanopore sequencing "frequently miscalled" telomeric repeats in various organisms, with "extensive" basecalling errors spanning across datasets, sequencing platforms, basecallers, and basecalling models. By training the nanopore basecaller with accurate telomeric sequences, the researchers were able to correct these errors while preserving the accuracy of the sequence in other parts of the genome.
"I think this study really highlights the issue when you are trying to study highly repetitive regions with the nanopore technology," said Kar-Tong Tan, a graduate student at Dana-Farber Cancer Institute and Harvard Medical School and the first author of the paper. "It's something you have to be very aware of when you're looking at the data."
Tan, who is working in the labs of Heng Li and Matthew Meyerson, the senior authors of the paper, said the study stemmed from the team's initial efforts to investigate telomeres in cancers using long-read sequencing technologies. Typically made up of TTAGGG repeats of various lengths in many organisms, telomeres play an important role in a variety of biological processes, including cancer biology, but are hard to analyze using short-read sequencing technologies due to their highly repetitive nature.
Currently, Pacific Biosciences high-fidelity (HiFi) sequencing and nanopore sequencing are the two major long-read sequencing platforms to help tackle long repetitive elements in the genome. "Both platforms have their own pros and cons," Tan said, adding that while PacBio sequencing tends to generate more accurate data, the flexibility, rapid turnaround time, and high accessibility afforded by nanopore sequencing make the technology also "very attractive."
When applying nanopore sequencing to the telomeric regions of the CHM13 sample, a human reference genome that was recently sequenced and assembled by the Telomere-to-Telomere (T2T) Consortium, Tan's team "surprisingly observed" that instead of the TTAGGG repeats, the telomere sequences were frequently represented by TTAAAA repeats.
"My first reaction was that it could be one possibility, [which is] a new telomere sequence that we have not seen before," said Li, a biomedical informatics professor at Harvard and Dana-Farber and Tan's mentor. However, because these sequences were not observed in the CHM13 reference genome or in PacBio HiFi sequences from the same site, the discrepancies were more likely to be artifacts of nanopore sequencing than real biological telomere variations, the researchers said.
After further characterizing these repeat-calling errors by looking into various nanopore sequencing datasets, the team concluded that these basecalling errors broadly exist across different basecallers and basecalling models, such as Guppy5 and Bonito, both developed by Oxford Nanopore Technologies, as well as different sequencing platforms, including the Oxford Nanopore MinIon, GridIon, and PromethIon. The analysis also revealed that these basecalling errors may be observed in other repetitive regions within the genome beyond telomeres, or in parts of the genome that harbor telomere-like repeat sequences.
Additionally, the authors analyzed nanopore sequencing data from eight model organisms, including Caenorhabditis elegans, chicken, mouse, and zebrafish, and found that repeat calling errors occurred in their telomeres, too.
To help remedy these errors, the researchers sought to tune the deep neural network model underlying the nanopore basecaller with more telomere training data. "What we basically did was that we took that initial deep learning model that they had, gave it some extra examples of the correct telomere repeat sequences, and modified the deep learning model slightly such that it then basecalls a telomere repeat correctly," Tan explained.
Specifically, the researchers trained the Bonito basecaller with the ground truth telomeric sequences of the CHM13 reference genome obtained by the T2T Consortium. They also performed two PromethIon runs on the CHM13 sample and used the data from one run for tuning the basecaller and the data from the other for evaluating the tuned basecaller. To avoid over-tuning, which might negatively impact other parts of the genome, the team applied a low learning rate during the tuning process.
The results showed that, when compared with the CHM13 reference genome, nanopore sequencing data processed by the tuned basecaller showed "a drastic reduction" of the repeat-calling errors in the telomeric regions while having minimal negative impact on other genomic regions, Tan said.
"I think it's a lovely little paper," said Winston Timp, a biomedical engineering professor at Johns Hopkins University. "The application of this is something that I have wanted to see for some time, and I'm really happy that somebody did it, and it's published."
Timp, who has longstanding experience with nanopore sequencing, said he and his collaborators have previously worked to measure the length of the telomeres in yeast using nanopore sequencing. "One of the things we wanted to do was to look for errors in the repeats that are being introduced, especially in indels," he said. "But we found that was really challenging to do with conventional nanopore tools, in part because of the issue that [this paper] reported here."
In addition to resolving the repeat-calling errors in the telomeric regions, Timp said, the basecaller-tuning approach presented in this study could offer a blueprint to help improve the accuracy of other parts of the genome with repetitive sequences that might be error-prone for nanopore sequencing.
However, he cautioned against overtraining the basecaller in order to preserve the overall basecalling accuracy and not to miss real biological changes in the genome.
Mirroring Timp's point, Tan said that although the team has only tested the method in human telomere samples for this study, the approach would presumably work in other organisms that also have TTAGGG telomere repeats, as well. "Theoretically, if you use the exact same model that we have trained for human [DNA], it should more or less work for these species," he said.
That said, Tan noted that moving forward, the team is also hoping to develop other tuned basecaller models for organisms that do not have the TTAGGG telomere repeats. Additionally, as the study was done using the Oxford Nanopore R9 chemistry, another future direction for the team is to test out the new R10 chemistry and update the tuning model accordingly.
Eventually, with nanopore sequencing's accuracy continuing to improve, Tan said he hopes the technology will allow researchers to obtain accurate results on repetitive regions without the need for his method.
"In a way, what we have is kind of like plastering over the issue right now," he said. "I do hope for the day when people don't have to apply our pipeline, [and] they can just directly apply what [Oxford] Nanopore provides."