Both Life Technologies' Ion Proton and Illumina's HiSeq 2000 perform well for exome sequencing when it comes to calling SNPs but have problems detecting insertions and deletions accurately, according to a recent platform comparison by researchers at the National Cancer Institute.
The study, which appeared online earlier this month in Human Genetics and is likely the first published performance comparison of the two platforms, compared variants called from whole-exome sequencing data generated on the Proton and the HiSeq for a HapMap CEPH family trio. It also compared the variants to whole-genome sequencing data from Complete Genomics, as well as to Illumina SNP microarray data for the same trio.
According to Joe Boland, director of research and development at the NCI's Cancer Genomics Research Laboratory and the lead author of the study, the goal of the project was to assess whether their lab could use the Ion Proton, which it received last September, routinely for exome sequencing as a "viable alternative" to the HiSeq, which he said is "the gold standard in research right now."
"And to our excitement, the answer was 'yes,' the Protons can stand toe to toe with the HiSeqs," he told In Sequence. "For a new platform, given our experience with PGMs, we expected it to be competitive but not as good as the data showed it to be — it was ahead of where we thought it would be."
While both platforms performed well for calling SNPs, the picture was different for indels, where they had problems. "Both have their pluses and both have their minuses in calling indels, and I think if you do the right legwork after [generating the data], then either platform is adequate right now," Boland said.
The NCI lab is currently equipped with six Ion Torrent PGMs, four Ion Protons, one HiSeq 2000, one HiSeq 2500 and one MiSeq.
Following the study, for which the researchers generated data in December and January and presented preliminary results at the Advances in Genome Biology and Technology conference in February (IS 2/26/2013), the lab now performs whole-exome sequencing on HiSeqs as well as Protons, depending on the availability of the machines and how quickly results are needed. Many of its projects are family exome studies, and small families are sometimes moved to the Proton if the HiSeqs are booked up, Boland said. "Because the quality now is comparable, our investigators don’t have a problem if we switch from one to the other." The lab is conducting transcriptome sequencing studies "predominantly on the Protons," he said.
The cost per sample for exome sequencing on either platform is within $150 of each other for the lab, Boland said, "so price is not a consideration for running one platform versus the other."
For their comparison, the researchers sequenced the exome of a CEPH family trio on the Ion Proton and the HiSeq 2000. To capture the exome DNA, they used Life Tech's TargetSeq Exome v2 for the Proton, which comprises about 50 megabases of sequence, and the NimbleGen SeqCap EZ Exome v3 for the Illumina, which captures about 64 megabases of sequence. They restricted their analysis to the 43 megabases of sequence that overlap between the two exome capture kits.
The Proton generated at least 9 gigabases of data per sample, with about 80 percent of the reads on target. To call variants, the data were run through the Ion Reporter standard pipeline.
The HiSeq generated more than 11 gigabases of data per sample, and about 66 percent of the reads were on target. Variants were called using the GATK pipeline.
In the shared exome, the Proton called about 28,000 variants per sample on average, and the Illumina 34,000 — about three-quarters of these were shared by both platforms.
For SNPs, the overlap between the two platforms was substantially greater than for indels. In one representative sample, both platforms called about 25,700 SNPs. In addition, 1,100 SNPs were only called by the Proton, and 7,000 only by the HiSeq.
In the same sample, both platforms called about 600 indels in common, but the Proton called another 880 indels, and the HiSeq another 920. When the researchers analyzed a subset of those platform-specific indels, they found that "many of them were potentially false positives due to alignment issues and/or homopolymer sequences."
The researchers also compared SNPs and indels called by the Proton, HiSeq, and Complete Genomics and found that 66 percent of SNPs, or 23,700, were detected by all three platforms, but only 18 percent of indels were, a total of 530.
Illumina had the greatest number of unique SNPs, 4,600, followed by 1,850 for Complete and 825 for the Proton.
The Proton found 830 indels that were specific to that platform, followed by 540 for Complete, and 440 for Illumina. There were more indels that agreed between the HiSeq and Complete but not the Proton — 480 — than those that agreed between the Proton and Complete but not Illumina, 56. Their analysis, the scientists concluded, "identifies major discrepancies in all methods in the detection of small indels, a major challenge that necessitates advances in both the technical sequencing and/or the bioinformatics algorithms."
When comparing the SNP genotypes of the Proton and HiSeq to SNP microarray data for two of the three trio samples, they found the concordance per sample to be high — more than 99 percent — for both platforms, indicating that these calls are of high quality.
The researchers also looked at the platform-specific variants more closely by examining the underlying read alignments. Most of the 1-base indels that were specific to the Proton were "well covered" in the Illumina data and likely represent false positives of the Proton, they wrote, noting that "while we found that the performance of the newer IonReporter 3.4 represented an improvement over the 3.2 version, there is still room for further improvement in accuracy."
Many Illumina-specific SNPs were in segmental duplications or simple repeats. Those in single-copy regions had low coverage in the Proton data and were therefore probably missed by the Proton, but there were also missing SNP calls in the Illumina data that were "clear and present" in the Proton data, they wrote.
Not everyone agrees with the study. "We respectfully disagree with the conclusions drawn from the data in this paper and have concerns on some of the methodology," Joel Fellis, market manager for sequencing systems at Illumina, told In Sequence.
For one, he said, the use of exome kits with different capture content is "a serious concern" because it resulted in "very different off-target rates and very different coverage profiles for the two platforms, putting HiSeq at a disadvantage in this comparison."
Boland said the reason they used two different capture kits is that there is no "commercially approved" protocol for using the NimbleGen or Agilent SureSelect capture kits with the Proton yet. "The thought was, 'we use everything that's approved, so people can pick it off the shelf and run with it,'" he said. Because they only analyzed the overlap regions and used identical DNA samples, "we felt that it was absolutely valid to do this."
Fellis also noted that a greater percentage of Illumina-specific variants than Proton-specific variants are found in dbSNP, indicating that the HiSeq has a lower false-positive rate than the Proton. The fact that the Proton missed many variants that were detected by the HiSeq and are also contained in dbSNP "is suggestive of a high false-negative rate for the Proton system," he said.
In their paper, the researchers pointed out that the run time for the Proton is "considerably shorter" – 11.5 hours including data processing – than for the HiSeq, which typically takes six days to run. But according to Fellis, the HiSeq 2500 in rapid mode can generate exome data for 20 samples in about 20 hours, which includes cluster amplification, and the entire workflow, including sample preparation, takes 2.5 days.
The Proton has also improved since the study was conducted. According to Mike Lelivelt, director of bioinformatics and software products at Ion Torrent, customers can now sequence two exomes per PI chip instead of one because the output per chip has increased.
He said the company is "pleased with the performance of the Proton system for exome sequencing" in the study, which he said showed "market-leading SNP accuracy" though "making accurate indel calls is tough for all platforms." He also noted that indels make up a much smaller proportion of all variants than SNPs.
Boland said his group is now conducting additional platform comparisons that focus on whole-transcriptome sequencing as well as amplicon sequencing. They are also further analyzing the platform-specific SNPs and indels to find out why the other platform missed them. He plans to present first results from this work in the fall.