As second-generation sequencers from Roche/454, Illumina, and Life Technologies compete for targeted sequencing market share, researchers at the Scripps Research Institute and the J. Craig Venter Institute recently published an evaluation of the three platforms for this application.
The two teams, which used versions of the platforms that have since been updated, found they each suffered from non-uniform coverage and inherent systematic errors, making it difficult to use data from different platforms for the same study.
The research, which appeared two weeks ago in Genome Biology, began in October 2007 when the team sought to compare the ability of the three platforms to perform targeted sequencing in populations.
At the time, "it was clear from the whole-genome association studies that were coming out that there were going to be intervals in the genome associated with traits, and that deeper sequencing in populations was going to be required in order to try to identify the functional, or causative, variants underlying those associations," senior author Kelly Frazer, director of genomic biology at the Scripps Genomic Medicine program, a collaboration between Scripps Health and the Scripps Research Institute, told In Sequence last week.
In their study, the researchers used the Roche/454 GS FLX, Illumina Genome Analyzer I, and Applied Biosystems SOLiD 1 platforms to sequence a 260-kilobase interval in four individuals with unpaired reads.
The work was split between the Scripps Institute, which prepared the DNA for the entire study by long-range PCR and performed the sequencing on its Illumina Genome Analyzer, and the Venter Institute, which sequenced the amplified DNA on its 454 machine and sent it to ABI for SOLiD sequencing.
The JCVI also had Sanger sequencing data available from a previous study that covered 88 kilobases within the 260-kilobase region, against which the researchers benchmarked their second-generation sequence data. The Scripps researchers analyzed the alignments and base calls from all three platforms.
Because all three platforms have been updated considerably since the study was performed, the results would not necessarily be the same with today's versions, Frazer pointed out.
Despite this caveat, the researchers said they discovered that the per-base coverage of the region was "very non-uniform" with each of the three platforms, and differed significantly between them.
Within each platform, the non-uniform coverage was not random but consistent across the four samples. It even persisted at very high, or "saturating," coverage, meaning that it cannot be overcome by sequencing more, according to the team.
These systematic errors tended to occur in regions that are difficult to sequence, such as repeat regions, homopolymer stretches, or regions with indels nearby.
At the time, the 454 platform had the most uniform coverage, probably due to its longer read length, according to Frazer.
The results suggest that it will be difficult to use data from different sequencing platforms for the same study. "You can't do all your cases on one platform and all your controls on another. It's not going to work," Frazer said.
This assessment differs from array-based genotyping studies that use different platforms, she pointed out, because in those studies, missing genotypes can be imputed.
The scientists also found that comparing base calls between second-generation sequencing platforms and genotyping arrays tended to overestimate the accuracy of the sequencing technologies, which is why they chose to compare the data to Sanger sequencing.
[ pagebreak ]
For a start, genotyping arrays "don't give you a sense of the false-positive rate of your platform," according to lead author Olivier Harismendy, a staff scientist in the Scripps Genomic Medicine program. Moreover, genotyping arrays "usually assess what we call 'well-behaved' bases, not those located in repeat regions, or homopolymer stretches," he said.
Comparing their second-gen data with the Sanger data, they found that all second-generation sequencing platforms identified at least 95 percent of the variant sites. The two short-read platforms had greater sensitivity, but lower specificity than 454's sequencer.
Apart from comparing the sequencing platforms, the researchers learned how best to integrate them with sample preparation by long-range PCR amplification. For example, they found a way to overcome a bias of the short-read sequencing platforms, which tended to over-sequence the ends of the long-range PCR amplicons, a method that they recently published in Biotechniques.
They also improved base-calling of heterozygous bases, a problem caused by allelic imbalance, or the preferential amplification of one allele in the PCR reaction, by covering each base by two amplicons.
According to Harismendy, long-range PCR is still cost-competitive for sequencing long contiguous DNA regions, despite the fact that companies like Roche NimbleGen, Agilent Technologies, and RainDance Technologies are now offering alternative methods for multiplexed targeted selection.
"Maybe for up to 500 kilobases, long-range PCR is still competitive," he said. However, "if you want to capture the exome, no PCR is cost-effective."
The researchers decided not to include a cost comparison in their study, because various factors, such as labor, reagent costs, and efficiency of use of a sequencing platform differ significantly between labs, according to Frazer.
Based on the results of their comparison, the Scripps researchers concluded that all three sequencing platforms could benefit from improvement of their uniformity and reductions in systematic errors, and that such improvements could involve all aspects of the sequencing process, including sample prep, sequencing chemistry, and data analysis.
According to the Scripps researchers, who only use the Illumina GA in house, many of these improvements have already come for their platform of choice. "The protocol has become better, the chemistry has become simpler, the runs have more throughput, they are faster, the uniformity has become better, [and] they have changed some of their chemistry, which makes it more accurate," Harismendy said. Also, the availability of paired-end reads "makes a big difference when you want to call indels," he added.
Given the cost of switching platforms — both in money and time — and the fact that none of the three platforms was clearly superior over the others, the Scripps researchers decided to keep their Illumina sequencer, which they have since upgraded to a GA II with a paired-end module.
"For now, we are going to stay with the Illumina," Harismendy said. "We are happy with this platform, we have optimized all the steps, so we will keep that. And the third generation of sequencers is around the corner. It would be, maybe, a bad time to switch."
Frazer said that a number of research groups are currently using second-generation platforms to sequence intervals from genome-wide association studies in order to find low-frequency functional variants, and she predicted that results from such studies will be published within the next six months or so.
Her own team has already applied the approach, using their Illumina GA, in two studies of risk factors for coronary artery disease and morbid obesity. In one study, the researchers sequenced a 200-kilobase region with a haplotype that has been highly associated with coronary artery disease in 25 carriers and 25 controls.
In the other, a collaboration with Sanofi Aventis, they sequenced a 200-kilobase interval associated with the endocannabinoid pathway in 150 individuals with morbid obesity and 150 controls. Sequencing for both studies is completed, and the analysis is ongoing.