This article has been updated to clarify the name of the new correction algorithm and the authorship of the PBcR paper as well as previously reported statements about the accuracy of PacBio reads.
Pacific Biosciences is working on several fronts to improve the accuracy of de novo assemblies generated on its single-molecule sequencer and could possibly release some informatics updates to enable this capability before the end of the year, BioInform learned this week.
Jonathan Bingham, product manager for software and informatics at PacBio, told BioInform that the company is working to integrate with its SMRT software a version of the Celera Assembler that includes a correction algorithm developed by researchers at Cold Spring Harbor Laboratory and elsewhere.
The above method was the subject of one of two Nature Biotechnology articles published this week describing methods for improving the accuracy of PacBio RS assemblies. The first author on the paper was Sergey Koren, a bioinformatics scientist at the National Biodefense Analysis and Countermeasures Center.
The paper describes the PacBio Corrected Reads, or PBcR, correction algorithm, based on the concept of combining PacBio reads — which have a median length of more than 2,000 bases but an average nucleotide accuracy of less than 85 percent — with shorter, more accurate reads than the Illumina and 454 platforms. The authors claim in the paper that PBcR improved the read accuracy from as low as 80 percent to more than 99.9 percent.
The second Nature Biotech paper, by PacBio researchers and their colleagues, used a similar hybrid assembly approach to the one described by Schatz et al.
The PacBio researchers used a combination of "scaffolding, overlap-layout-consensus, and error-correction methods" to assemble the genome of a cholera strain using reads generated by PacBio, Illumina, and Roche 454 sequencers. That approach, which used PacBio's A Hybrid Assembler, or AHA, scaffolding algorithm, achieved overall accuracy of more than 99 percent.
Meanwhile, the PBcR method developed by Koren et al. "trims and corrects individual long-read sequences by first mapping short-read sequences to them and computing a highly accurate hybrid consensus sequence," the paper states. These corrected reads can then be assembled by themselves or in combination with sequence data from other instruments.
"The idea is [that] we have one of these long reads with an error about every five or six bases on average [and] because the errors are actually randomly distributed along the read, there'll be regions where there won't be a mistake for ten or fifteen or even twenty bases," Michael Schatz, an assistant professor at CSHL and an author on the paper, explained to BioInform. "We can use that as a way to guide which short reads align there," he said.
Using the modified version of the Celera Assembler, the team mapped multiple short reads to each long read, generating a "mini assembly ... that computes the consensus sequence from all those data, [which] becomes our error-corrected long read," he explained.
Speaking with BioInform this week, Eric Schadt, PacBio's chief scientific officer and chair of the department of genetics and genomic sciences at Mount Sinai School of Medicine, said the firm has long advocated the adoption of the hybrid long read/short read approach as a way to generate complete assemblies. He added that PacBio has used components developed by some of the authors of the CSHL paper in its internal error-correction pipelines.
He described the Koren et al. paper as a "good formalization of things we had done and a great comparison of how [hybrid assemblies] really make a difference."
In addition to incorporating the PBcR algorithm — which the company is calling pacBioToCA — PacBio is also making improvements to some of its current de novo assembly algorithms — for example in its consensus-calling algorithms — so that users can assemble new genomes without performing error-correction steps, Bingham said.
"We are working on how to improve the accuracy with that type of assembly so that ... once you pile up, say, 30x coverage in your de novo assembly, that the consensus sequence that you take for your assembly is as close to the truth as possible," he explained.
The company is also working on improving its algorithms for resequencing and SNP validation, he said.
"We're working to make variant calling even better, which we believe we can do using an approach similar to the Broad's haplotype caller, which performs a local realignment around candidate variable sites," he explained.
In addition to informatics efforts, PacBio is also making improvements to its sequencing instrumentation that will allow it to generate long reads with accuracy that is comparable to reads generated by Sanger sequencing, Bingham said.
PacBio offers a circular consensus mode that can provide 99 percent accuracy on templates up to 1,000 bases, but beyond that "we are now getting to a single pass on a single molecule,” so the accuracy in that range is 85 percent to 87 percent, Bingham said. “We are working to push that out so that you'll be able to get, say, 2,000 bases at Sanger-level accuracy and even beyond that.”
He stressed that it is "only for these very long reads that are especially useful for de novo assembly where there is this tradeoff between accuracy and length."
Bingham said the company is addressing this issue by improving the way its SMRT cells are loaded so that more of the larger fragments are loaded onto the instrument.
That "increases the average read lengths that you get and that helps you to get more and longer of these very accurate reads," he said. In the context of de novo assemblies, "what that would allow you to do is get more of the very long reads on the order of 3,000 or 5,000 or 8,000 bases."
PacBio is also working on informatics tools for detecting compound mutations and rare variants, as well as tools for 16S and metagenomics data analysis, Bingham said.
Although he could not give a definite timeline, he said that the company expects most of its planned improvements to be available in a matter of months.
However, he noted that researchers can begin using the PBcR and AHA approaches described in the Nature Biotech papers right away.
Schatz and Schadt both said that the hybrid assembly-based approach for PacBio reads is quite straightforward in principle, but conceded that the variety of data required and the additional steps in the assembly process make it a bit more complicated than assembling reads from a single system.
Schadt noted that although the methods discussed in both papers require more steps than some current assembly pipelines, "it's nicely packaged into these different tools that any person skilled in rudimentary bioinformatics would be able to put together" and apply to their data.
Schatz acknowledged that hybrid assemblies might require a larger initial investment than an assembly generated on a short-read instrument alone, but noted that in the long run the hybrid approach may be more cost effective if it produces a better-quality assembly.
He and his co-authors note in their paper that for the S. cerevisiae genome, an assembly using 13x PacBio data and 50x Illumina data was "comparable" to an assembly generated from 100x paired-end Illumina data. "The corrected PacBio sequences also generated a more accurate assembly" than the 100x Illumina data, they said. The same was true in the case of 454 data, where a 25x PBcR assembly for E. coli "tripled the N50 of the 50x 454 assembly."
"You have to look at the whole picture," Schatz said. "Yes, the sequencing costs more expensive with PacBio upfront but if that enables you to save labor in terms of doing directed finishing, or if the assembly is much more connected such that the genes or pathways that you are trying to study are in a single contig or in a single scaffold instead of scattered into lots of little pieces ... I don’t know what the dollar figure that result is worth but I think that part is often lost in the equation."