In a study testing the ability of sequencing platforms to sequence and assemble genomes of Escherichia coli, researchers from Quintiles found that assemblies generated solely from the Pacific Biosciences RS platform produced more complete genomes than Illumina's MiSeq, Life Technologies' Ion Torrent PGM, or a combination of PacBio data with one of the two short read platforms.
Writing in BMC Genomics earlier this month, the researchers said that not only was the PacBio RS able to generate the most complete and accurate de novo assemblies of E. coli strains, but that "the addition of other sequencing technology data offered no improvements over use of PacBio data alone." Additionally, PacBio sequence data enables base modification detection without having to do a separate experiment, as is required with other next-gen systems.
Lead author Jason Powers, an associate manager of bioinformatics client services at Quintiles' Expression Analysis, told In Sequence that the goal of the study was to do a "broad evaluation" of sequencing platforms and assembly techniques. "When thinking about a project like genome assembly, there are a lot of choices and investments that have to be made before beginning — different sequencing machines, run modes, software, assembly strategies," he said.
The firm chose three common sequencing platforms for microbial sequencing and four different assembly strategies. "We couldn't look at all permutations available," Powers said, "but these are the primary options."
The researchers evaluated the MiSeq and PGM platforms for sequencing and de novo assembly individually with three different assemblers — Velvet, Ray, and Mira. They evaluated PacBio RS de novo assembly using the Celera assembler.
They then evaluated hybrid methods — de novo assembly with either the MiSeq or PGM, followed by scaffolding with PacBio RS reads, and de novo assembly with PacBio RS reads followed by error correction with either the MiSeq or PGM.
PacBio-only sequencing and assembly produced the most complete and most accurate genomes, the authors reported. For instance, in one case, using reads only from PacBio sequencing resulted in 21 contigs, compared to 31 contigs and 49 contigs from hybrid assemblies with PGM and MiSeq, respectively.
Additionally, "PacBio-only assemblies were not only more complete than the hybrid-assemblies, but largely more accurate," the authors wrote. The PacBio-only assembly contained 14 SNPs, compared to 90 or more SNPs in the hybrid methods.
Powers said that he was somewhat surprised that the hybrid assembly approaches did not offer any improvement. "You would think that combining multiple sources of data would be beneficial, that using more data would produce more complete assemblies," he said. Each of the platforms has its own advantages and disadvantages, so he said he was surprised that combining them was not able to take advantage of the platforms' unique abilities.
For instance, while the RS has the longest read lengths, the MiSeq and PGM simply produce a lot more data.
However, Powers said that PacBio's new bioinformatics software, which enables self-correction of its reads, appeared to be as good as, if not better, than error correction with short read sequencing from the PGM or MiSeq. Additionally, he said that the random error profile of the PacBio machine likely also contributed to the system's accuracy performance. "The error mode is random, which is important because if you have enough sequence, the error profile gets washed away." For instance, he said, if there are 10 different reads, each with errors, those errors will all be in different places, canceling each other out, he said.
As with many platform comparison studies, by the time the study has been published, the technology has already advanced.
For instance, Mike Lelivelt, director of bioinformatics and software products at Ion Torrent, told IS that the study did not take into account de novo assembly using Ion Torrent mate pair sequencing, and that, since publication, the PGM's read lengths have increased to 400 bases from the 200 base reads that were used in the study.
Coupling 400-base reads with long-insert mate pair sequencing would have dramatically increased performance of the PGM de novo assembly, Lelivelt said.
Additionally, the researchers did not take into account cost, Lelivelt said. The authors "are not acknowledging the capital resources involved and the space required to run [the PacBio system]," he said. The PacBio RS sells for a list price of $695,000, while the PGM runs around $50,000 not including ancillary equipment, and the MiSeq lists at $125,000.
Illumina declined to comment on the study.
Powers acknowledged that the study did not examine cost, although a recent study in Genome Biology by researchers from the University of Maryland estimated that microbial genomes could be sequenced and assembled on the PacBio RS for around $1,000. That price, however, did not take into account initial investment costs in the platform.
Additionally, Jonas Korlach, PacBio's CSO, told IS that compared to hybrid methods, PacBio-only sequencing is more cost efficient because "you don't need to run as many sequencing technologies or make as many libraries."
Korlach agreed with Lelivelt that because technology is advancing so quickly, published studies often evaluate outdated technology. However, he said that PacBio has also improved since the study's publication. The BMC Genomics study used PacBio's C2 chemistry, "which has now been largely replaced," with a version that enables "complete assemblies at QV50 with just one SMRT cell," he said.
Powers said that even if the study were redone today using all the companies' latest improvements, he thought the overall results would stay the same, since PacBio still has an advantage over MiSeq and PGM in terms of the characteristics that enabled it to produce the more complete and accurate microbial genomes — its long read lengths and random error profile.
"Certainly all the machines produce longer sequence reads than what we evaluated here," he said. PGM's reads have jumped to 400 bases from 200 bases and MiSeq's reads have increased to 300 bases from 150 bases.
"I don't think that's going to change the results. PacBio reads are now twice as long as what we used here, so we will get even more complete assemblies," Powers said, adding that some groups have reported assembling microbial genomes into a single contig using only PacBio data.
Despite his concerns, Lelivelt said the study addresses an "important topic" — that it is not only the sequencing platform itself that plays a role in the quality and completeness of de novo assemblies. "Algorithms aren't always optimized for various platforms," he said, which can lead to the assembly process itself having a greater effect on the data than the sequencing system.
For instance, the authors acknowledged that performing de novo assembly of the PGM data with the Velvet assembler resulted in poor performance because Velvet was "originally published using Solexa data, and the poor performance with Ion Torrent data is likely due to its inability to cope with the Ion Torrent error profile." On the other hand, a different assembler, MIRA, was able to produce a more complete assembly with PGM data, but "struggled with the MiSeq data."
"You have to acknowledge that there's as much variability in the informatics as in the sequencing platforms themselves," Lelivelt said. "It's a complex issue," he added, and "the informatics really matter."