Skip to main content
Premium Trial:

Request an Annual Quote

In Comparison of De Novo Assembly of Bacterial Genomes, PacBio System Comes Out On Top


NEW YORK (GenomeWeb) – Comparing four next-generation sequencing platforms, researchers from Osaka University have found that Pacific Biosciences' RS provides the most suitable platform to generate finished-grade genome assemblies of bacteria.

Reporting in the journal BMC Genomics in August, the team compared the abilities of the PacBio RS, Illumina's MiSeq, Thermo Fisher's Ion Torrent PGM, and Roche's 454 GS Junior to de novo sequence the bacteria Vibrio parahaemolyticus, which consists of a 5 mb genome comprising two circular chromosomes. Using the PacBio long reads, they were able to assemble the genome into two contigs, each of which represented one chromosome, while the other NGS systems produced much more fragmented genomes.

Shota Nakamura, an assistant professor at Osaka University's laboratory of genome informatics, told In Sequence that his lab is now using the PacBio RS for bacterial sequencing and assembly, particularly for species without a reference genome. For bacteria with a reference, he said the Illumina MiSeq is more "appropriate in terms of cost and accuracy."

He added that the lab is now also working on using PacBio "to get the finished-grade genome of higher organisms, which have multiple chromosomes and highly repetitive sequences."

In the study, the researchers compared the four platforms in their ability to sequence and assemble a 5 mb bacterial genome. For the GS Junior and PGM sequence data, they used the Newbler assembler. They used CLC to assemble reads from the MiSeq system and Sprai, an in-house developed algorithm, for the PacBio reads.

Using PacBio, the team generated 31 contigs, representing 73-fold coverage of the genome. However, they were able to cover the genome with just two of those contigs, each of which represented a single chromosome, one that was approximately 3.3 mb and another that was 1.9 mb. The mean read length was over 3 kb.

The optimized results from the PGM using the 318 chip produced an assembly consisting of 61 contigs assembled from reads of 77-fold coverage. The contig N50 was 392 kb and the longest contig was 895 kb in size. With MiSeq, the researchers assembled the genome into 34 contigs from 58-fold coverage. The contig N50 was 431 kb and the longest contig was 733 kbp. The GS Junior assembly had 309 contigs from 9x coverage. Contig N50 was 30 kb and the longest contig was 165 kb.

Nick Loman, a bioinformatician at the Centre for Systems Biology at the University of Birmingham, who has also compared NGS platforms, said that the paper is a "useful comparison of de novo assemblies," and that the "results are consistent with what we know about the platform capabilities."

For instance, a study also published in BMC Genomics last year, found similarly that the PacBio RS produced a more complete assembly of the Escherichia coli genome when compared to the MiSeq, PGM, and even when compared with a hybrid approach.

Loman noted, however, that another factor to consider is flexibility of the instrument, such as the ability to multiplex many samples. In addition, "instrument cost and practicality is important. A PacBio [system] is a large-scale investment for most labs, whereas a MiSeq or an Ion Torrent are cheaper options," he said.

The researchers estimated that at a cost per gigabase, the PacBio RS would cost $1,800, while the PGM would cost $437 and the MiSeq would cost $93. However, PacBio's Chief Scientific Officer Jonas Korlach noted that since the study was done, several improvements to the sequencing chemistry have been made that increased the throughput of each SMRT cell. "Now, using the P5/C3 chemistry, we can get even longer reads and higher throughput," Korlach said, meaning fewer SMRT cells would need to be used, reducing the cost.

Mike Lelivelt, director of bioinformatics and software products at Thermo Fisher, also noted that PGM metrics have improved, including now having 400 bp reads. He also said that using the SPAdes assembly algorithm, rather than the Newbler, would have resulted in a better de novo assembly. That is the assembler the company currently recommends for de novo sequencing, he said, because it has good error correction and can also take into account both fragment read libraries and long mate pair libraries.

Like Loman, he said that while it is "clear that long reads are helpful, the PacBio is expensive both to buy and to run."

Illumina declined to comment on the study.

Each of the systems had slightly different error rates. Looking at the PacBio assembly comprised of just the two contigs, there were 157 mismatches and 798 indels. However, when using all 31 contigs, there were significantly more errors — 389 mismatches and 715 indels.

The MiSeq assembly had 230 mismatches and 184 indels, the PGM assembly had 108 mismatches and 2,853 indels, and the GS Junior had 133 mismatches and 824 indels.

The proportion of the genome covered by each assembly was 99.848 percent and 99.999 percent for the two-contig and 31-contig PacBio assemblies, respectively; 98.499 percent for the MiSeq; 98.290 percent for the PGM; and 97.844 percent for the GS Junior. In addition, the PacBio assembly had 10 misassemblies, defined as the total number of relocations, inversions, and translocations. The GS Junior and PGM both had zero, while the MiSeq had one misassembly.

Nakamura said that with regards to accuracy, a key metric is the percentage of bases that align to the reference, and PacBio had the highest aligned region. In addition, "further parameter tuning may improve the accuracy of PacBio sequencing," he said.

Korlach said that the paper is another validation of the importance of long reads to generate finished genomes. "There is a perception that you can overcome shortcomings of short reads with higher coverage," he said, but this study shows that "in some cases, the assemblies got worse as you added more data." By contrast, "with PacBio, you have such long reads, that you need less coverage."

Other researchers have previously reported on their use of a hybrid approach to genome assembly, using the long reads of PacBio to generate long contigs and Illumina data to increase coverage and accuracy of the reads.

Nakamura said that his lab has looked at this option, but that in the case of bacteria sequencing, PacBio-only assembly is suitable and there is no need to do a hybrid assembly. "In a case where we do not have enough DNA or the size of the genome is huge, hybrid assembly would be an option," he said.