Researchers from Agencourt Bioscience, Boston College, Applied Biosystems, and the Department of Energy’s Joint Genome Institute have published a paper comparing how second-generation sequencing platforms made by ABI, Illumina, and Roche’s 454 Life Sciences resequence the genome of a yeast strain for mapping its SNPs comprehensively.
The study, which used earlier versions of the three technologies than are currently available, concluded that all three are equally suited for the task at above 10-15-fold sequence coverage.
Though the researchers did not provide a cost analysis in their paper, which marks the first time that a direct comparison of the three platforms has appeared in a peer-reviewed journal, prices quoted by sequencing service providers suggest that the short-read technologies have an edge over 454 in this particular application.
“This study really shows the power of the new technologies for finding very rare [mutational] events,” said Gabor Marth, an assistant professor at Boston College and an author of the study. He pointed out that the strain he and his colleagues sequenced only has one SNP per megabase, a lower SNP density than the human genome.
The study, which appears online in Genome Research this month, has been long in the making (see In Sequence 3/6/2007). Paul Richardson, former program head of R&D and head of the microbial program at JGI, and Doug Smith, director of science and technology at Agencourt Genomic Services, Agencourt’s sequencing facility, conceived of the study about two years ago, according to Richardson.
At the time, JGI had 454 and Illumina instruments on site, “and we were trying to get a better idea of which ones of those were useful for different applications, and we also wanted to compare them to the SOLiD,” he told In Sequence last week.
For their comparison, they chose to sequence a mutant strain of Pichia stipitis, a haploid yeast with a 15.4-megabase genome, which is unusually efficient in converting xylose into ethanol.
The team mapped the unpaired sequence data to the genome of a reference strain of P. stipitis that was sequenced by the Sanger method and that JGI and its collaborators published last year.
The data for the latest project was generated over a period of about a year, starting two years ago, according to Richardson, who joined workflow automation company Progentech as vice president of R&D this spring.
JGI sequenced the mutant strain in a single run on an Illumina Genome Analyzer “classic,” generating 826 megabases of filtered data, or 44.2-fold coverage from aligned reads; Agencourt provided sequence data obtained from its 454 Genome Sequencer FLX platform, generating 199 megabases of filtered data in two runs, or 10.8-fold coverage from aligned reads; while ABI produced data on the first version of its SOLiD system, generating 7.9 gigabases of unfiltered data in a single run, or 175-fold coverage from aligned reads.
Marth’s team mapped the Illumina and 454 reads to the reference genome using its Mosaik alignment program. Since Mosaik, at the time, was unable to align SOLiD data, which uses two-base encoding or “color-space,” ABI analyzed the SOLiD data using its own SOLiD alignment tool.
The scientists then screened the Illumina and 454 read alignments for SNPs using the Gigabayes program, a new version of Marth’s Polybayes software; and the SOLiD color-space alignments using ABI’s own mutation-analysis software.
In total, the three technologies discovered 17 candidate mutations that differed from the reference genome, which were all confirmed by Sanger sequencing. Three of them turned out to be mistakes in the reference sequence.
At 10-fold sequence coverage, the SOLiD data resulted in zero false-positive, or spurious, SNPs and zero false-negative, or missed, SNPs.
Illumina’s data, on the other hand, yielded two false-positive and zero false-negative calls at 13-fold coverage, and zero errors at 19.4-fold coverage.
The 454 data, at 10.8-fold coverage, generated one false-positive SNP — which “mostly likely” resulted from a PCR error during sequence library construction, according to the paper — and no false negatives.
“This study really shows the power of the new technologies for finding very rare [mutational] events.”
Even though ABI was the only vendor who generated and analyzed its own data for the study, all three vendors were aware of the project and were able to comment on the data prior to publication of the study, Richardson said. “It was a concern, but we tried to be as even-handed as possible.”
“Of course we know from experience that typically, the machine manufacturers can sequence the best [on their platform],” said Marth. “So whenever the data comes from them directly, that’s basically the best quality.”
However, because the study is based on a single dataset from each platform, the differences in the results are not statistically significant, Marth said. Also, he pointed out, the study used a different analysis pipeline for the SOLiD data than for the other two platforms, making the results less comparable.
“There was not a clear winner,” he said. “We were able to find the same mutations with all the platforms.”
“Illumina and ABI were very close in their ability to detect mutations at the lower coverage levels,” Richardson said. “All three were equally good at finding them at the higher coverage levels. […] “I think the take-home message is that you need probably 15-fold-ish data to be absolutely sure you have got most” of the mutations.
The technologies differed slightly in how well they covered the P. stipitis genome. In order to map the unpaired reads from the three technologies uniquely, the scientists had to mask repeat regions. Because of its longer reads, the 454 technology could cover a larger fraction of the genome, 96.7 percent, than the two other technologies, which covered 93.2 percent.
However, the researchers found that the distribution of sequence coverage across the genome was “similar” for the three sequencing technologies, though they all deviated from a Poisson distribution, suggesting that “there are regions of the Pichia genome that are more facile to sequence than others,” according to the paper.
“There did not seem to be any specific regional biases,” Richardson noted, adding, “that’s not to say that there might not be some underlying sequence-specific biases, but we did not find any.”
But according to Michael Egholm, 454’s vice president of research and development, the results show that 454 is “the clear winner” with regard to sequencing coverage bias.
Since the Agencourt/JGI team generated its data for the study, Illumina has replaced its GA “classic” with the GA II, ABI has upgraded its SOLiD platform to version 2.0, and 454 is about to roll out its Titanium upgrade for the GS FLX. All three vendors say the new versions provide better greater throughput and data quality.
As a result, the study “is not completely relevant to today’s technology,” said Agencourt’s Smith, adding that “the results we reported in this paper would be a kind of worst-case scenario for sequencing a haploid genome [today].”
“It’s rapidly changing technology, which is why it was difficult to make too many strong conclusions,” Richardson agreed. However, based on data he has seen from upgrades of the platforms, “I think by and large, the conclusions of the paper haven’t changed.”
Cost is another factor that users likely deem important in a cross-platform comparison, but the researchers decided not to include that information in the paper. “There was a lot of discussion about that, and we wanted to try to include that, but there are several reasons we did not,” Richardson said.
One reason is that the technologies are changing rapidly, causing throughput to rise and costs per base to decline.
“But I guess the bigger issue is that the costs for running these are different for everyone,” he said. “Everyone is, really, paying different prices — and quite significantly different prices — for reagents, and the instruments themselves, in some cases.”
But according to an In Sequence poll of three commercial and academic sequencing service providers that employ more than one second-generation sequencing platform, customers currently pay less on the short-read platforms from ABI and Illumina than on the 454 platform to obtain the same amount of sequence coverage. All providers asked to remain anonymous.
One provider said it charges customers $5,000 to $6,400 for about 200 megabases of unpaired sequence data — or 13-fold coverage of the P. stipitis genome — on the Illumina GA II, and $10,700 for the same amount of sequence data on the GS FLX Titanium. Both prices include SNP and indel detection.
Another provider charges $10,000 for 400 megabases of sequence data, or 27-fold coverage of the P. stipitis genome, on the SOLiD, and $25,000 for the same amount of sequence on the GS FLX Titanium.
A third provider told In Sequence that a quarter of a SOLiD plate — which he estimated generates at least 20-fold coverage but probably 40- to 60-fold coverage of the P. stipitis genome — starts at $2,500.
He said that half a plate of sequencing on the GS FLX Titanium, which he expects will provide “at least” 15-fold sequence coverage of P. stipitis, will probably cost about $6,500. A whole run, which “should deliver well over 20x coverage,” will likely cost about $12,000.
But a technology and cost comparison might yield different results if the goal was to discover sequence variations other than SNPs, such as copy number variations or small indels.
“What we were not able to do in this study, which we really wanted to do, is to look at small deletions, but the nature of the data we generated at the time was mostly unpaired libraries,” said Richardson. “That limited our ability.”
Also, the P. stipitis mutant did not appear to harbor many small indels, according to Marth. In a different genome, with more of this type of variation, the results of the comparison “might have looked a little different,” he said.
Whole-Genome Mutation Analysis
The study’s authors believe that whole-genome sequencing will soon become a widely used method for characterizing model organisms with mutant phenotypes, even those with more complex genomes than yeast.
“With another tenfold increase in throughput, this will be very cheap, and it could become routine” for model organisms such as C. elegans, according to Marth.
“We are going to see more and more of this in larger and larger genomes, and probably more complex genomes,” Richardson said. “It’s going to be much easier and more cost-effective to just do a whole genome scan and mutation profiling over a more targeted PCR-type approach, and we may have already reached that point in certain cases.”
But according to Egholm, the mapping approach used in this study has its limits when it comes to discovering structural variations. He said that 454 actually provided a de novo assembly of the P. stipitis genome for the study, which could reveal potential structural variations, but was not included in the paper. “We firmly believe that all resequencing will eventually move to be based — at least in part — on de novo sequencing,” he said.
Agencourt already offers whole-genome sequencing services for mutation analysis on its 454 and its SOLiD platforms, depending on a customer’s needs.
“I think that application is going to be very important, especially for bacterial and fungal organisms, where it’s very efficient,” Smith said.
The company plans to perform more platform comparisons internally, he said.