In an effort to determine which second-generation sequencing platform to get for their institution, researchers at Columbia University have compared the ability of the Illumina Genome Analyzer and Applied Biosystems SOLiD sequencers to accurately detect mutations in a mutant strain of C. elegans.
In their study, which is based on data generated last spring and was published in Public Library of Science One two weeks ago, the scientists found that using similar coverage and mapping criteria, the SOLiD platform led to fewer false-positive variants than the GA, while the GA determined fewer false-negative variants compared to the SOLiD.
The scientists acknowledge that their comparison is merely a “snapshot” taken during “a technological tornado,” and that the results are likely already outdated.
Based on their results, and in order to serve the needs of a variety of users, the Columbia researchers chose to acquire both an Illumina GA and an ABI SOLiD.
The study started last spring “with a basic dilemma that many institutions face: we had a budget to get a sequencer, and it spurred discussions among the faculty regarding what technology we should pursue, and whether we should pursue only one technology or both,” said Itsik Pe’er, a professor of computer science at Columbia University and an author of the study. “There are trade-offs in terms of keeping diversity versus maintaining different pipelines.”
At the time, Columbia had already ordered a 454 Genome Sequencer, so the decision was between an Illumina Genome Analyzer or an ABI SOLiD.
The scientists decided to compare the two platforms side by side, testing their ability to detect mutations in C. elegans by whole-genome sequencing. In early April, they sent identical DNA samples of a mutant C. elegans strain to Illumina’s sequencing service for analysis on the GA, and to commercial service provider Agencourt Bioscience for sequencing on the SOLiD platform.
Using the GA I — the first version of the Genome Analyzer, which has since been replaced by the GA II — Illumina provided 125 million 35-base paired reads. Agencourt produced 256 million 25-base paired reads on the SOLiD platform. It was unclear as of press time which version of the SOLiD they used.
The scientists then mapped the reads using the Maq algorithm, which was developed by the Wellcome Trust Sanger Institute. In addition, they also aligned the SOLiD data using ABI’s own alignment tool, corona-lite.
The initial idea, Pe’er explained, was to use the same analysis tools in the comparison, but it turned out that Maq “was developed with the GA in mind, so its handling of the ABI data does the platform injustice” because it lost one base of the already short 25-base reads. Using ABI’s software “made a noticeable difference” in terms of the number of reads mapped accurately, he said.
After mapping, the average coverage of “good” reads — defined as reads with no more than three mismatches that occurred either in “good” pairs or single ends — was the same for both platforms, at about 25-fold.
In their analysis, the researchers focused on a 4-megabase region, an area to which they had previously mapped the mutation causing the phenotype of the mutant C. elegans strain. They published results from the Illumina GA data in an earlier study in Nature Methods last summer.
Using identical filtering criteria, the scientists found that the GA detected 31 true single-nucleotide variants in this area, which they confirmed by Sanger sequencing. However, the platform also led to four false-positive variants.
“The problem with such a paper is that it produces a snapshot of the technology, and by the time the snapshot is developed and put into print, it’s already outdated.”
The SOLiD detected 23 true variants, among them one that the GA had not detected because the coverage was too low, and no false positives. However, the platform missed nine of the confirmed variants, mainly because of insufficient depth-coverage. All of the 23 true variants detected by SOLiD were also detected by the GA.
Both platforms detected the variant that gives rise to the mutant phenotype of the C. elegans strains.
The Illumina platform also called more small insertions and deletions in the 4-megabase interval than the SOLiD, though the researchers did not validate all of these indels.
Regarding sequencing errors in mapped reads, the SOLiD seems to have a leg up. The scientists estimated an error rate of 0.036 percent for the SOLiD, and an error rate of 0.6 percent for the GA.
With regard to sample prep, the scientists pointed out that “it has been previously noted that the emulsion PCR step required for the SOLiD platform is cumbersome and technically challenging, which contrasts the apparently straightforward library preparation step for the GA.”
They did not include a cost comparison in their study because the services they negotiated for this project may not accurately reflect the cost for other users, according to Pe’er.
Based on the results of the study, users may come to different conclusions as to which platform to choose for a specific application, according to Pe’er. “It really depends on the balance of resources and specific tasks,” he said.
For the application of identifying mutations in C. elegans mutants, the scientists write in their study, “a false negative, i.e. the missing of the one phenotype-causing mutation, is not tolerable; therefore, the GA platform appears the preferable choice for our system.”
But according to Kevin McKernan, senior director of scientific operations for SOLiD, the comparison was not entirely equal, and certain choices in experimental design and analysis may have contributed to the results. Not all details of the analysis were available from the article, he noted.
For example, the SOLiD SNP caller version “can have a big impact on false positives/false negatives,” he said in an e-mail message, and it is unclear from the article which version the scientists used. Also, the false negative rate could be a function of the SNP calling criteria, which should differ for different read types, he said.
In addition, the study employed different types of libraries — a 5-kilobase circularized library for SOLiD, and a 500-base sheared library for the GA — which require “two drastically different forms of DNA manipulation,” according to McKernan.
Finally, while the GA data was produced by Illumina itself, the SOLiD data came from a third-party provider, he said.
Illumina did not respond to a request for comment on the study before deadline.
But the results of the study are also already outdated because the technologies have improved in read length, accuracy, and throughput. “The problem with such a paper is that it produces a snapshot of the technology, and by the time the snapshot is developed and put into print, it’s already outdated,” Pe’er acknowledged. “We had the data, so we made it public. But the race continues between these players and others, as you know.”
Based on their study, the Columbia researchers decided to acquire both an Illumina GA and an ABI SOLiD, which recently arrived. “It’s pretty fair to say that different tasks would benefit from different technologies, and it seemed worthwhile investing in having the diversity in house, and allowing different investigators that have different needs to access the technology of choice,” according to Pe’er.
The Columbia study is not the only project exploring different sequencing platforms for mutation detection. The 1000 Genomes Project, for example, has been using three different platforms — 454’s GS FLX, Illumina’s GA, and the ABI SOLiD — in its pilot phase (see In Sequence 1/22/2008).
Richard Durbin, the 1000 Genome Project’s co-chair, told In Sequence this week that the project has not yet completed a comparison of the different platforms. “From our observations so far, both GA and SOLiD can identify sequence variants with high accuracy at relatively low cost, as this paper says,” he said in an e-mail message.
Two weeks ago, the 1000 Genomes Project released some variants from one pilot project, he said, which sequenced two trios of parents and child at high coverage. On one of the trios, SNPs were called from Illumina data; for the daughter of the other trio, SNPs and small indels were called from SOLiD data “with some confirmation from Illumina,” according to Durbin.
The 1000 Genomes Project plans to release more variant data at the end of the month, he said, including data from samples sequenced at low coverage and more complete data from both trios.
In another study published last fall, researchers from Agencourt Bioscience, Boston College, Applied Biosystems, and the Department of Energy’s Joint Genome Institute compared the 454, GA, and SOLiD sequencing platforms for resequencing the genome of a yeast strain and mapping its SNPs comprehensively (see In Sequence 9/16/2008).
That study concluded that all three are equally suited for the task at above 10-15-fold sequence coverage.