A recently published performance comparison between the Ion Torrent PGM, Illumina MiSeq, and Pacific Biosciences PacBio RS sequencers by the Wellcome Trust Sanger Institute found key differences between the platforms in terms of data quality and applications supported.
The results, however, are already outdated as several of the platforms have been updated since the data were generated last year. For example, the PacBio data described in the paper used the company's C1 chemistry, which provided shorter reads and lower throughput than the current C2 chemistry, and a new sequencing reagent kit is now available for the PGM.
"The limited yield and high cost per base currently prohibit large scale sequencing projects on the Pacific Biosciences instrument," the scientists concluded, while "the PGM and MiSeq are quite closely matched in terms of utility and ease of workflow."
In their comparison, the MiSeq excelled in terms of raw read accuracy, but the PGM called the greatest percentage of SNPs correctly.
Michael Quail, who heads the sequencing R&D group at the Sanger Institute, told In Sequence this week that his team conducted the comparison both for its own interests and "on behalf of the smaller labs that would like to choose just one of the currently available platforms." He is the lead author of the study, which was published online in BMC Genomics last week.
For their comparison, the scientists sequenced a set of four microbial genomes, with a GC content ranging from 19 percent to 68 percent, on all three sequencing platforms. The Sanger Institute uses these genomes routinely to test new sequencing technologies because they "represent the range of genomic landscapes that one might encounter," the authors noted.
All the data were generated in-house in the latter half of 2011, Quail said.
For the MiSeq, the researchers used two different library prep protocols — one including an amplification step, the other one PCR-free — because the new PCR-free approach results in more even sequence coverage. They sequenced the libraries with paired 150-base reads.
For the PGM, they also used two library prep kits: the standard library kit, which involves physical shearing of the DNA, and the Ion Xpress Fragment Library Kit, which fragments the DNA enzymatically. Ion Torrent libraries were run on the Ion 316 chip for 65 cycles, generating mean read lengths of about 120 base pairs.
For the PacBio, they used standard PacBio libraries with an average of 2-kilobase inserts, which were run over multiple SMRT cells, using the C1 chemistry.
After "downsampling" the sequencing datasets, so that they each represented an average genome coverage of 15x, they mapped the reads to the four microbial reference genomes.
In order to determine the ability of the platforms to call SNPs accurately, they aligned the 15x datasets from one of the genomes, Staphylococcus aureus, to the genome of a closely related strain. They then compared the SNPs called to those obtained by aligning the S. aureus genome to the related strain.
Quail said they were aware that data processing might influence the results, so to be as fair as possible, they used a generic SAMtools pipeline for the Illumina data, even though more optimized tools for Illumina were available. For PGM data, they used both SAMtools and the Ion Torrent variant calling pipeline and essentially received the same results. For PacBio data, they used the platform's SMRT portal pipeline.
Judged by a variety of criteria, the three platforms showed a number of performance differences.
Genome Coverage, GC-Bias, Error Rate, SNP Calling, Cost
In terms of genome coverage, the "most dramatic" observation was that the PGM failed to cover about 30 percent of the extremely AT-rich Plasmodium falciparum genome, having trouble in particular with introns and AT-rich exons.
In addition, the PGM gave "slightly more uneven coverage" than the other platforms for the GC-rich Bordetella pertussis genome, according to the paper, but Quail said it provided unbiased coverage on the other test genomes.
The scientists also reported they were able to "profoundly" reduce the bias of the PGM for sequencing P. falciparum by using a different enzyme — Kapa HiFi — for the library prep amplification step.
The PacBio platform provided a dataset with "quite even coverage" for GC- and extremely AT-rich contexts but demonstrated "slight but noticeable" unevenness of coverage and bias toward GC-rich sequences for the S. aureus genome, they said.
All three platforms gave "equal coverage with unbiased GC representation" for the GC-neutral Salmonella pullorum genome.
In terms of accuracy, the MiSeq performed best, with a raw error rate below 0.4 percent. The PGM had a raw error rate of 1.78 percent, and the PacBio RS of 13 percent.
The MiSeq tended to produce errors after long homopolymer tracts of more than 20 bases but made very few errors in short homopolymers. It also made strand-specific errors near the GGC motif, especially in the second read. Quail said that these motif-specific errors could possibly be resolved by using a new polymerase or sequencing buffer.
The PGM did not generate any reads for homopolymers longer than 14 bases and made mistakes in homopolymers longer than 8 bases. It also made strand-specific errors, which the researchers could not associate with any motifs.
Quail said he expects the error rate for the PacBio to decrease in the future as the technology matures.
With regard to SNPs, the researchers found that the overall SNP calling rate was slightly higher for the PGM than the MiSeq — 82 percent instead of about 76 percent of true SNPs were called — but the rate of false positive SNPs was also higher for the Ion Torrent than the MiSeq.
SNP calling from PacBio data "proved more problematic," the authors noted, as the existing tools are optimized for short-read data and not PacBio's error-prone long-read data. The 15x coverage was also not sufficient for SNP calling with the RS, so they used 190x coverage instead. Given those challenges, PacBio SNP calling was "not as accurate" as for the other platforms.
The authors cautioned that variant calling is "a highly subjective process" that depends on the software and parameters used, so the SNP calling rates they reported "are purely indicative and results obtained with each sequencing platform will vary." Optimizing the SNP- and indel-calling algorithm for each application and platform "would always be recommended."
They also compared criteria like instrument cost, sequence yield per run, cost per gigabase, run time, reported accuracy, read length, availability of paired reads, and insert size.
Of note are the differences in sequencing cost, based on list prices. The MiSeq came out cheapest, at $502 per gigabase, followed by the PGM, at $1,000 per gigabase using the Ion 318 chip, and the PacBio, at $2,000 per gigabase. All three platforms produce data at a greater cost than the Illumina GAIIx, at $148 per gigabase, and the HiSeq 2000, at $41 per gigabase.
Regarding the PacBio, the researchers commented that the DNA input requirements "can be prohibitory" compared to the two other platforms. While the PacBio needs on the order of a microgram of DNA, the MiSeq and PGM can get away with 100 nanograms of DNA or less.
They also noted that PacBio protocols favor the preferential loading of smaller library constructs, "resulting in average subread lengths that are significantly shorter than the often quoted average read lengths."
Overall, the researchers concluded that many of the applications developed for Illumina sequencing "should translate well and be equally practicable" for the Ion Torrent PGM, with the exception of techniques that involve steps on the Illumina flow cell, such as FRT-seq and OS-seq.
The current size of the DNA fragments for the PGM library — up to 250 bases — is probably too short for the accurate de novo assembly of mammalian genomes that have been shown using Illumina data, they noted.
On the other hand, the Illumina platforms have trouble sequencing monotemplates, where most fragments have exactly the same sequence, which the Ion Torrent can do without problems, they said.
The long reads of the PacBio should prove useful for de novo sequencing and for the analysis of linkage of alternative splicing and variants across long amplicons, they said. It also has potential for detecting epigenetic modifications. While they found the long reads useful for scaffolding de novo assemblies, "our experience suggests that this is currently not fully optimized and extensive method development is still required," they said.
"The yield, sample-input requirements and amplification-free library prep of PacBio potentially make it unsuitable for counting applications and for applications involving significant prior enrichment such as exome sequencing and ChIP-seq," they added.
Sejal Sheth, vice president of strategic marketing for PacBio, told In Sequence that the comparison focused on standard metrics used to evaluate short-read technologies, "which are not necessarily as relevant for long read technologies." As such, the study "does not assess many of the applications that PacBio customers are pursuing, such as phasing SNPs over long ranges, reading complete repeat stretches, identifying large structural variation, or directly reading base modifications."
Sheth also noted that the data for the comparison was generated using the C1 chemistry, not the current C2 chemistry, which provides longer reads and higher throughput, delivers higher consensus accuracy at lower coverage, and requires less input DNA.
In addition, new algorithms from PacBio and others are available now that make SNP calling more effective and accurate, for example the GATK software from the Broad Institute, Sheth said.
Sheth also said that "for the purpose of comparing platforms, it is consensus accuracy, not raw read accuracy, that is the relevant metric," which the study did not measure.
The company is also working on a magnetic bead system that will enable longer DNA fragments to be preferentially loaded into the SMRT cells, among other improvements for the system (see story, this issue).
According to an Ion Torrent spokesperson, the study was a fair comparison at the time the experiments were performed. "This is a fine paper recounting where we were in 2011," he said. However, the error rate of the PGM has since improved and is only 0.4 percent in Ion Torrent's internal datasets from June.
Rob Tarbox, Illumina's market manager for sequencing, said the company believes the comparison is fair and validates the performance of the MiSeq platform seen in another recent comparison, by researchers at the University of Birmingham (IS 4/24/2012).
Quail said his team has rerun the test genomes whenever there were "significant updates" for any of the platforms, and "essentially the results are still relevant."
He noted a number of improvements: a reduction of the PGM bias for sequencing P. falciparum was "immediately evident" with the new 200-base pair kits, for example.
Also, the PacBio C2 chemistry has improved the SMRT cell yield, and the best observed accuracy has been 88 percent, whereas it was typically 85 percent with the C1 chemistry. Other improvements have reduced the data variability and increased the mean subread length, he said. Also, "recent advances in using short read data to correct long PacBio reads have dramatically improved the utility of PacBio data for de novo assembly," he added.
Following the evaluation, the Sanger purchased and installed three MiSeq instruments, which are now in production, along with the center's 27 HiSeqs, two GAIIx, and two 454 GS FLX machines.
The main reason for going with the MiSeq instead of the PGM was that the Sanger's sequence production is Illumina-based, so it could use its existing infrastructure and pipelines to feed into the MiSeq instruments, he said.
In addition, the institute maintains two PGMs, one MiSeq, and the PacBio in its R&D facility.