SAN FRANCISCO (GenomeWeb) – A Peking University team has developed a fluorogenic sequencing method and demonstrated that it has the potential to be highly accurate, generating perfect reads up to 200 bases.
The researchers have filed for patents on the technology and licensed it to Cygnus Biosciences, a Beijing-based startup, and are collaborating with the company to develop a higher throughput instrument that they hope to commercialize in about two years.
The work, which was published this week in Nature Biotechnology, builds off of a proof-of-concept study published in 2011, which introduced the concept of fluorogenic pyrosequencing — a method that sought to combine the long-read benefits of pyrosequencing but that relied on fluorescence-based detection.
Yanyi Huang, senior author of the study and professor of materials science and engineering at Peking University, said that the major advances from the proof-of-concept study include the use of improved fluorophore substrates that produce stronger signals as well as a technique the researchers developed called error-correction code sequencing, which essentially involves three rounds of sequencing.
The researchers have developed a prototype instrument and have licensed the technology to Cygnus Biosciences. Huang said that the next step is to work on increasing the throughput. Huang said that the technique would be particularly useful in applications such as "fetal genetic mutation detection in maternal blood and rare mutation identification in circulating tumor DNA or in highly heterogeneous tumor tissues."
Robert Sebra, an associate professor of genetics and genomic sciences at the Icahn School of Medicine at Mount Sinai, who was not involved with the study and who wrote an accompanying editorial in Nature Biotechnology, said that the team "essentially designed a sequencing strategy from nuts to bolts."
In the study, the researchers build on the fluorgenic sequencing approach previously described in 2011 that involves using terminally labeled dNTPs where the label becomes fluorescent in the presence of phosphatase.
Huang said the fluorogenic sequencing approach aims to incorporate both the advantages of cyclic reversible termination, like what is used in Illumina's instruments, as well as single-nucleotide addition techniques like those used by Thermo Fisher's Ion Torrent. Cyclic reversible termination approaches use fluorescence to signal nucleotide incorporation, which is "stable and highly efficient," Huang said, but "the chemistry is complex and leaves scars on the nascent strand after cleaving the fluorescent tags." That causes errors and limits the read length. Single-nucleotide addition has the potential to generate longer reads because it does not damage the DNA, but the detection method has a low signal-to-noise ratio, he said.
In the current study, the researchers improved on the previous fluorogenic sequencing technique by improving the sequencing chemistry and designing new methods for error correction and bioinformatics.
To improve on the chemistry, the team used a fluorophore known as Tokyo Green, which essentially has a much higher signal-to-noise ratio. The Tokyo Green fluorophore is "much brighter, which produces a stronger signal, and has a narrower fluorescence spectrum for the detector to gather more light," Huang said. In addition, "it is dark enough before the reaction to generate sufficient contrast."
The second major advance is what the researchers call error-code correction. In the sequencing process, which the researchers call degenerate-base fluorogenic sequencing, single-stranded DNA templates are grafted onto the surface of a glass flow cell and annealed to a sequencing primer with the 3' end as the starting point for sequencing-by-synthesis reaction.
In each sequencing cycle, a polymerase, phosphatase and fluorogenic nucleotide mix reacts with the DNA template. When the polymerase incorporates the correct nucleotide, a non-fluorescent phosphatase is release, but then converted to a very bright fluorescence through dephosphorylation. In the previous study, the researchers would introduce one of the four substrates into the reaction for each cycle. But, in this case, the team uses two bases. "There are three different ways to mix the four bases," Huang said. So, they essentially "sequence the same DNA molecule using the three different mixing ways," which "provides extra information to identify and rectify the sequencing errors," he said.
Sebra said that the error-correction code technique was "what makes their technology the most novel." Essentially, he said, the method involves doing three orthogonal sequencing runs. "It's a physical way to do error correction as opposed to algorithmically," he said.
The resulting so-called "degenerate sequences" — the pairs of fluorescent molecules that correspond to the DNA template — have to then be further decoded into the actual sequence. To do this, the researchers designed algorithms similar to string graph assembly methods for single-molecule sequencing that decode the two-base fluorescent signal and construct the most likely sequence. "The string graph approach is similar to network analysis and string-graph assembly methods, but in this case, the researchers are doing it base-by-base," Sebra said.
The algorithms rely on principles of information theory, Sebra added, essentially translating the degenerate sequences into binary strings that are decoded into error-corrected sequences.
In the study, the team demonstrates sequencing of the lambda phage genome. The researchers sequenced three templates of lambda phage DNA, demonstrating that they could sequence up to 200 bases error-free. They calculated raw accuracy to be 99.82 percent in the first 100 bases and 99.45 percent in the first 200 bases with ECC decoding reducing the cumulative error rate from 0.96 percent to 0.33 percent in 250 bases.
Sebra said that although the work demonstrated was novel and very interesting, the team has a lot more work to do to develop a sequencing system for commercial use. "It's not yet high throughput, which makes it hard to estimate error rates," he said. To really get a good understanding of a sequencing technology's systematic errors, many more sample types would have to be sequenced, he said.
Nonetheless, he anticipated that the study would generate a lot of interest. The study tackles the issue of sequencing accuracy and quality, an area in which there has only been incremental improvement in recent years, with sequencing technology companies instead focusing primarily on read length and throughput.
In addition, he anticipated that the error-correction approach would become more difficult with longer read lengths. Similar to other sequencing methods like Illumina or Thermo Fisher's Ion Torrent, "the accuracy tends to trail off as you go further and further," he said. That's a known issue that's often due to changes in enzyme kinetics or the likelihood of encountering a homopolymer or another DNA structure, and he said would likely apply to this technology as well.
Huang said that the team is continuing to develop the technology and is especially working to increase the throughput by incorporating "novel approaches in microfluidics, instrumentation, and bioinformatics." He said that the goal is to have a commercial instrument within a couple of years.