Skip to main content
Premium Trial:

Request an Annual Quote

CSHL-Led Team Develops De Novo Genome Assembly Method with Error-Corrected PacBio Reads


By Andrea Anderson

Using a tailored error-correction approach and a hybrid assembly method that incorporates reads from multiple sequencing platforms, researchers have developed a pipeline for doing de novo genome assembly using long reads generated on the Pacific Biosciences RS instrument.

At the Plant and Animal Genomes meeting in San Diego last week, Cold Spring Harbor Laboratory researcher Michael Schatz presented data on de novo assemblies for several microbial species sequenced with the PacBio RS and other platforms, showing that hybrid assemblies containing error-corrected PacBio long reads assembled with a modified version of the Celera Assembler can produce contigs of a million bases or more.

Similarly, when the researchers compared de novo assemblies of the parrot genome, sequenced using Illumina, Roche 454, and PacBio platforms, they found that they got more continuous assemblies, longer contigs, and greater contig N50 sizes for assemblies done using a combination of PacBio and other read types than they did for de novo genome assemblies done without the PacBio long reads.

With Illumina sequence data alone, the parrot genome assembly "came out pretty good," Schatz told In Sequence. "But then we demonstrated that adding in even modest coverage with PacBio doubled the continuity of the assembly."

In general, genomes are assembled by stitching together sequence reads at sites where these reads overlap one another, Schatz explained during a presentation at PAG. Once a basic assembly graph has been cobbled together, it can then be simplified and detangled using additional data from long reads, mate pairs, genetic markers, and more.

But a number of instrumentation, computational, and accuracy factors — not to mention potentially confounding biological features such as ploidy, heterozygosity, or the presence of repeat sequences — can affect the quality of genomes assembled de novo in this manner, Schatz said.

For example, low coverage produces a fragmented assembly, since some parts of the genome are not covered enough to prevent gaps between reads. On the other hand, sequence errors can obscure sequence overlaps needed for assembly.

Getting over stretches of long repeats in the genome, meanwhile, requires long reads. When reads are too short to make it across an entire repeat, for example, it can lead to false overlap between reads.

Because the sequencing instruments that are currently available vary in their read length, throughput, sequence accuracy, and error profiles, the platform used for sequencing can have a profound influence on the quality of de novo assembly that's possible for a given genome.

So far, no single sequencing platform excels on all of the fronts needed for an ideal assembly, Schatz explained. For example, while Illumina short reads offer high throughput and sufficient coverage to ward off fragmented genome assemblies, they do not generate reads long enough to span some repeats.

On the other hand, platforms such as PacBio can rapidly generate reads that are thousands of base pairs long and have uniform sequence coverage, with no apparent GC-bias.

But with raw read error rates hovering around 15 percent, PacBio can produce muddled assemblies owing to errors that obscure authentic sequence overlaps. The problem is further complicated by the fact that many of the errors in PacBio reads are insertions and deletions rather than relatively straightforward substitution errors.

At PAG, for instance, Schatz presented results for a 12 million base pair yeast genome that he and his colleagues sequenced using the PacBio RS instrument. Based on data for 100,000 PacBio reads, they found that the sequence accuracy was 83.7 percent overall and contained 11.5 percent insertion errors, 3.4 percent deletion errors, and 1.4 percent mismatch errors.

Given the advantages and disadvantages associated with different sequencing platforms, Schatz explained, the best strategy for doing de novo genome assembly with PacBio long read data, at least for the time being, is to combine error-corrected PacBio reads with reads generated on other platforms and/or using circular consensus sequencing on the PacBio instrument itself (IS 7/12/2011).

Schatz credits Sergey Koren from the University of Maryland and the National Biodefense Analysis and Countermeasures Center with coming up with the error-correction method at the heart of the team's PacBio assembly pipeline.

The "pacBioToCA" script helps correct PacBio errors by mapping shorter reads to long reads, trimming long reads at coverage gap sites, and computing the consensus sequence for each long read.

"The pacBioToCA script is a correction pipeline to enable the use of long-read sequences produced by the PacBio RS instrument," developers wrote on a SourceForge site describing the method. "To algorithmically deal with the error, we require alternate high-identity sequences (454, Illumina, or PacBio circularized sequences)."

In contrast to an error-correction method such as Quake, which is specialized for dealing with the sort of substitution errors that predominate in Illumina reads, the new method also corrects for insertions and deletions, which make up the majority of errors in PacBio sequences.

As Schatz reported at PAG, for example, researchers saw a sharp jump in both sequence coverage and identity when they applied the error-correction method to sequence data for the K12 Escherichia coli strain that was sequenced to 20 times coverage with the PacBio instrument and to 50 times coverage using the Illumina HiSeq2000.

For the same E. coli K12 strain, the team found that it could get much longer contigs — up to or beyond one million bases each — by doing assemblies with a combination of 50x error-corrected PacBio reads and 50x Illumina reads.

A hybrid assembly approach also proved useful for doing de novo assembly of the parrot genome, Schatz reported at PAG.

From 3.75x error-corrected PacBio coverage of the parrot genome and 15.4x Roche 454 GS FLX and GS FLX + coverage of the genome, the team assembled a de novo parrot genome that is nearly 1.1 billion bases long, has a maximum contig length of more than 1.1 million bases, and a contig N50 size of around 99, 573 nucleotides.

In contrast, the contig N50 for a parrot assembly based exclusively on Illumina reads covering the genome at around 194x was just shy of 50,000 nucleotides. From 15.4x Roche 454 coverage, meanwhile, researchers got an assembly with a contig N50 size of around 75,000 nucleotides.

To do their hybrid de novo PacBio assemblies, the team developed a pipeline that included a version of the open source Celera Assembler software — first developed more than a decade ago to assemble reads generated using the Sanger approach — that was retrofitted to work hand in hand with the new PacBio error correction approach.

"We started with Sanger sequences and there was a lot of theory built around that," Schatz noted. "With the introduction of short reads, brand new theory had to be developed. But now we can kind of go back to some of the old methods and it works really well for these types of data."

The team wrote a new front-end, error-correction module for the Celera Assembler and is also borrowing from a suite of open source assembly tools known as AMOS. Within the internal workings of the assembler itself, though, the researchers had to make fairly minor tweaks to accommodate the very long reads generated on the PacBio platform.

"It was already quite good for 1,000 base pairs reads and a lot of work had already been put in to support 454 and Illumina reads," Schatz explained. "So just adding in these very, very long reads was relatively straightforward."

Though such de novo assemblies may not be necessary for resequencing studies of organisms for which reference genomes are already available, the long reads offer an advantage for assembling the genomes of organisms without an existing reference or in situations where a finished, high-quality assembly is needed, Schatz explained.

"In practice, a lot of the sequencing today is human resequencing and things of that nature, where you can get really far using an Illumina shorter read technology," he said.

"The time that you really need the long reads is if it's a genome where there's no reference sequence, if you're interested in large-scale structural variations, or if you're interested in finishing — for instance, in a forensics type environment where you need to know every base."

Have topics you'd like to see covered in In Sequence? Contact the editor at anderson [at] genomeweb [.] com.