CHICAGO – Researchers from NuProbe, Rice University, and Microsoft Research UK have developed a novel deep-learning method to predict DNA sequencing depth from the sequence of DNA probes with up to 99 percent accuracy.
The researchers developed the predictive computational method, described in a paper published this week in Nature Communications, on a recurrent neural network they tested with next-generation sequencing panels containing nearly 40,000 probes. They then validated their technique and determined that it can predict sequencing depth with 93 percent accuracy for a single-nucleotide polymorphism panel, 99 percent accuracy for a panel targeting nonhuman sequences for DNA information storage, and 89 percent accuracy for a long noncoding RNA panel trained on the SNP panel.
According to the paper, the same model can also predict the measured single-plex kinetic rate constants of DNA hybridization and strand displacement.
David Yu Zhang, cofounder and head of innovation for Houston- and Shanghai-based molecular diagnostics and technology company NuProbe, and one of two corresponding authors of the article, said he was not aware of any previous deep-learning model for predicting NGS depth.
"The whole point of the deep-learning model that we developed is to predict when certain sequences would have low binding yields," Zhang said. "We can compensate for it by either designing more probes or increasing titrations."
Siyuan Chen, chief technology officer of Twist Bioscience, said via email that his company has multiple proprietary processes to analyze sequencing data, including one machine learning-based method that helps customize probes. That, according to Chen, "yielded unparalleled uniformity in sequencing coverage in the industry." However, Twist has not publicly detailed such processes.
Twist was not involved in the research described in the paper, though the researchers used probes synthesized by that company.
In their article, the authors wrote that their computational model "could inform the selection of probe sets with higher uniformity and modulation of probe concentrations to achieve higher uniformity."
They also said that they have patents pending for X-probes and their model of rate-constant prediction developed for this research.
Targeted sequencing panels are often used to detect somatic mutations, but each DNA hybridization probe will have slightly different binding kinetics, causing bias relative to a whole-genome sequence. "The enrichment of the genes of interest ends up having non-uniformity due to the properties of the DNA binding," Zhang said.
In building their deep-learning model, the researchers chose a recurrent neural network, like those widely used in commercial speech recognition and natural-language processing software, because they determined this type of network is right for capturing both short-range and long-range interactions within DNA probes. They wrote that conventional feed-forward neural networks and convolutional neural networks, which have a fixed number of input points, are "not well-suited for DNA sequence inputs."
For example, DNA can form what Zhang called a "massive hairpin," in which the first base binds to the last base, the second base binds to the penultimate base, and so on. "You can have a very long if/then statement or something that can be recognized by recurrent neural networks that really cannot be recognized by convolutional neural networks because of the pooling and layers," he said.
Zhang started a two-year leave of absence on July 1 from his faculty appointment in bioengineering at Rice University to help steer NuProbe through a growth phase with an eye toward an eventual initial public offering.
NuProbe, founded in 2016, originated at Rice and Harvard's Wyss Institute; another cofounder, Peng Yin, is a systems biologist at Harvard Medical School. The firm, which announced a $42 million funding round earlier this year, offers target enrichment assays and reagents for NGS-based cancer diagnostic tests.
Once they built the recurrent neural network, the research team trained the deep-learning model with a 39,145-plex SNP panel and, independently, with a 7,373-probe synthetic panel. A 2,000-probe lncRNA panel helped the researchers validate their computing method.
"We treated each prediction as a separate labeled instance, but what we want to do eventually is to have lots of independent NGS libraries to look at," Zhang said. He noted that making an NGS library is both computationally and labor intensive; it currently takes about three days to make one library.
Zhang said that NuProbe is now working with Microsoft Research to make this technique generalizable to longer DNA sequences. He said the research team has not yet explored whether the deep-learning model can be applied to targeted sequencing with other sequencing technologies, but noted that NuProbe does have a separate relationship with Oxford Nanopore Technologies.
"Later on, we would like to apply machine learning to many aspects of genomics, but we're taking baby steps," Zhang said.
Twist Bioscience's Chen said that the research represents a "viable hypothesis," adding that he believes that the NuProbe-Microsoft method can indeed be applied to DNA sequences of varying lengths. "It will be interesting to watch the progress and see what evolves," he said.
Zhang said that the architecture of this first iteration of the neural network and the deep-learning model might not be optimal yet. "We are certainly exploring many other kinds of neural network architectures," he said. "I do think that there's a lot of room for additional improvement in terms of either transferring learning to other genomics problems [or making] a more complex network that can further improve the accuracy of our predictions."
The software code is available on GitHub, but Zhang does not expect many members of the open-source programming community to download and add to the technology just yet. "It's kind of clunky, so probably it makes more sense for the Microsoft side to make it into something that's more accessible."