A team from the University of Texas, Austin and the University of California, San Francisco has come up with DNA circularization-based sample preparation and bioinformatics methods for bolstering the accuracy of high-throughput DNA sequencing.
As they reported online last week in the Proceedings of the National Academy of Sciences, the researchers introduced DNA circularization and rolling circle amplification steps prior to standard sequencing sample prep — an approach that lets them assess each stretch of targeted sequence several times before computationally deciphering a consensus sequence from these repeats.
"Our method is basically a way of creating copies of the information from each molecule, so that there are multiple chances to look at these things," co-first author Jeffrey Hussmann told In Sequence.
"Hopefully, it won't get them wrong every time, so you can use those copies of the information to subtract some errors," said Hussman, a graduate student supervised by senior author Sara Sawyer and co-author William Press, both at the University of Texas, Austin.
When paired with the Illumina MiSeq instrument for the team's proof-of-principle sequencing experiments on the yeast model organism Saccharomyces cerevisae, for instance, the circle sequencing method diminished base call errors by several orders of magnitude.
That made it possible to bring the high-throughput platform's accuracy to a level approaching that obtained using traditional-but-pricy Sanger sequencing methods, the researchers reported, while offering an efficiency edge over barcode-based error-correction approaches that have been coupled with next-generation sequencing in the past.
Even so, those involved in the study noted that error-correction methods, in general, are still best suited to situations where especially high accuracy is required. Given the added reads needed to achieve a given depth of coverage, they explained, such approaches are often reserved for experiments where accuracy is imperative, such as finding rare variants and/or characterizing highly heterogeneous samples.
"Conventional sequencing is probably still the best method for doing something like whole-genome sequencing," co-first author Dianne Lou, a graduate student in senior author Sara Sawyer's molecular biosciences lab at the University of Texas, Austin, told IS. "But if you want to look at variants within a population, you're not going to use conventional sequencing for that. You want a more accurate method."
For their part, Hussman, Lou, and their colleagues plan to work with other labs on the University of Texas, Austin campus to use the circle sequencing approach to look at genetic variation in non-homogenous sample sets.
The method, in general, shares some similarities with the circular consensus sequencing strategy already being used to pare down the error rate associated with Pacific Biosciences' single-molecule, real-time sequencing methods.
That approach also centers on a consensus sequence developed using data from a polymerase working its way around a topologically closed DNA circle, Hussman said. But there are differences as well.
For one, circular consensus reads produced with the PacBio protocol contain adaptor sequences introduced during library preparation. In contrast, the new circle sequencing method uses template DNA alone that has been circularized with the help of an enzyme called circle ligase.
Rather than reading around the DNA circle itself during sequencing, as in the production of PacBio circle consensus reads, the circle sequencing method uses rolling circle amplification to produce repeats from the starting template.
Each trip around the circle by the Phi29 polymerase enzyme used for amplification produces an independent copy of that template, resulting in a molecule with replicas of the initial sequence placed back-to-back. Collections of those repeat-containing molecules can then be fed into a typical library prep protocol suited to the sequencing instrument at hand.
As such, stretches of starting material are essentially amplified and sequenced several times, making it possible to produce read families that represent each original template. And because each copy is produced independently, errors introduced during amplification aren't propagated.
"Each read that comes out of the sequencing machine is reading a molecule that is a rolling circle amplification product with multiple copies that are physically linked to each other," Hussman said.
By pairing the circle sequencing protocol with high-throughput sequencing platforms that already have relatively modest error rates, he explained, it becomes possible to get error-corrected reads with even higher accuracy.
"No individual piece of the process is particularly exotic," Hussmann said. But as read lengths on high-throughput platforms have stretched out, he added, it has become feasible to put these steps together into a circle sequencing pipeline.
For their current proof-of-principle study, the researchers focused on the 12 million base S. cerevisiae genome, using 51-fold conventional sequencing to first get a sense of how the strain on hand, called S288C, differs from the yeast reference genome.
"Once we had done that, we felt confident that we had a modified reference genome that was accurate for the strain we were looking at," Hussman said.
For their circle sequencing experiments, the researchers diced up yeast genomic DNA and denatured it to form single-stranded pieces. After circularizing those molecules with circle ligase, they then amplified the DNA using the Phi29 polymerase enzyme and random primers, which also help to re-form double-stranded DNA during the rolling-circle amplification.
From there, the study's authors did standard sequencing library prep and paired-end sequencing on Illumina's MiSeq platform, though they noted that "rolling circle products generated in circle sequencing can theoretically be sequenced on any high-throughput sequencing platform that offers read lengths long enough to observe multiple repeats within the same product."
When they scoured the resulting data to search for remaining experimental errors or computational artifacts, the investigators initially saw signs of DNA damage, particularly deamination that appeared to be introducing errors into the circle sequencing data.
Once they tackled that potential source of genetic glitches by adding DNA repair enzymes to the circle sequencing sample prep protocol, the team saw a pronounced increase in per-base accuracy with the circularization-based error-correction method.
Because the circle sequencing approach relies on ligation between naturally occurring DNA sequences without an adaptor sequence, the group also had to devise computational methods that both defined the boundaries between repeats in a given read and spit out a consensus sequence from those repeats.
The number of repeats needed for error correction is expected to vary depending on factors such as the actual heterogeneity present within a given sample, Hussman noted, as well as a sequencing platform's typical error rate and error sources.
The group's analysis indicated that much of the error correction could be carried out with just two copies of each template, though there were additional accuracy benefits to producing even more repeats per read.
In the current study, the researchers aimed for three repeats per read, or overall circle lengths of around 150 bases, though the DNA shearing process typically produces a size distribution around that target.
The read lengths that can be achieved using the MiSeq have increased since the study was performed, though, meaning longer starting templates could be compatible with circle sequence reads on that instrument.
The increased sequence data needed to represent each stretch of targeted DNA with multiple repeats is expected to dial up the cost of circle sequencing compared to conventional sequencing, though that overall price varies with the instrument used.
Nevertheless, the study authors concluded that circle sequencing could offer both cost and efficiency advantages over barcoding-based alternatives for doing error-correction on high-throughput platforms.
"It's certainly more expensive than not doing it," Hussman said. "You're spending money to get higher quality data."