Not many laboratories have the luxury to compare all three of the new sequencing platforms – 454/Roche’s Genome Sequencer, Illumina’s Genetic Analyzer, and Applied Biosystems’ SOLiD platform. The UK’s Wellcome Trust Sanger Institute is one of them. Earlier this month, Ian Goodhead and Chris Clee, two researchers from the institute, discussed pros and cons for the three platforms, as well as results from an initial experimental comparison, in two separate talks at the Advances in Genome Biology and Technology conference in Marco Island, Fla.
The institute has had access to all three platforms for various amounts of time: in mid-2005, it received a 454 GS 20, and it recently added a GS FLX. In December, the Sanger Institute obtained a Solexa (now Illumina) 1G Genetic Analyzer. In addition, the institute recently began a collaboration with Applied Biosystems in which it sends the company samples for analysis on the new SOLiD platform, and is hoping to obtain one of ABI’s early-access placements later this year.
The Sanger researchers have had ample time to work with 454’s GS 20 and to get to know the instrument inside out. In their best run to date, they generated 52 megabases of data on the instrument in a single run, exceeding 454 and Roche’s specifications of 20 megabases per run.
Goodhead and Clee said they were happy with the instrument’s long reads and paired tags, and its ability to handle multiple samples on a subdivided picotiter plate. They also found it collects and extracts data fast. 454’s Newbler assembler is “very good” at assembling contigs from the instrument’s reads, Goodhead said: For example, in a sequencing project of Chlamydia trachomatis, the assembler detected a contaminating species. “We had a lab contamination in the Chlamydia sample,” Goodhead said. “We thought we were sequencing one genome whereas Newbler managed to pull out effectively 16 large contigs which matched a different genome.”
But the scientists also pointed out some challenges of the system. For a start, the instrument is very “hands-on,” requiring a lot of manual pipetting, Clee said. Roche, 454’s marketing partner, also recommends using a large amount of starting material – 3 to 5 micrograms of DNA. “That’s not always applicable to bacterial sequencing if you have got samples that are difficult to come by,” Goodhead said.
During the emulsion breaking step of the sample preparation, there is also a risk of contaminating the sample with unwanted DNA. Roche recently changed its protocol for this step, which now involves centrifugation and might decrease the contamination risk, according to Goodhead, but “If you prepare more than one sample at a time, there is a risk that things get mixed,” he told In Sequence.
In addition, Roche recommends some ancillary equipment for the 454 instrument that adds significantly to the purchase cost. Among these additional instruments, many of which a sequencing lab would not already have, according to Goodhead, are an Agilent Bioanalyzer to analyze the DNA, a tissue lyser to create the emulsion, a Coulter counter to count the beads, and special centrifuge rotors to spin down the beads in the plates. Last but not least, 454’s technology has the well-known problem of homopolymer read errors.
Although the scientists have only had Illumina’s Genetic Analyzer in-house for a couple of months, they presented some points they do and don’t like about it. One advantage, Goodhead noted, is that the system gets away with a lot less sample material than the 454 system — between 100 nanograms and one microgram. The researchers also pointed out that the risk of contaminating the sample is low since all amplification steps take place inside the flow cell. They also like that each flow cell has eight channels, allowing eight samples to be run in parallel.
One of the shortcomings of the instrument is that the current read length – about 25 base pairs in the researchers’ hands – does not allow them to assemble the data de novo with current assemblers. That might change once Illumina releases its protocols for paired-end reads, which it presented at the meeting, Goodhead noted. Also, the large amount of data the instrument produces in every run — between 0.5 terabytes and 0.8 terabytes — has to be moved off the instrument for analysis, he said, increasing the analysis time, and takes up a lot of storage space. The optical parts of the instrument need to be kept extremely clean, the scientists pointed out.
The Sanger scientists have had little experience with ABI’s SOLiD platform, and do not have a system in house, but they presented some of its features nevertheless. Like the other two platforms, ABI’s lets users run multiple samples per run. Unlike the other instruments, though, the SOLiD can run two sample slides in parallel and analyze the data in real time, the scientists pointed out. Paired-end reads are already available, enabling de novo assemblies, and the 2-base-encoding scheme that ABI has developed improves the system’s ability to call measurement errors, Goodhead said. He also mentioned that any cycle of a run can be repeated during the same run.
But ABI’s platform presents challenges, too. Instead of a one-base encoding system where each of four colors represents one base, the company’s 2-base-encoding system represents four different 2-base combinations by one of four colors. Decoding these colors requires knowledge of the very first base. The instrument records the colors, not bases, during the run and decodes them into bases afterwards. Scientists need to get used to this new data representation, and “the conversion between color space and base space is not trivial,” Goodhead said. Like Illumina’s instrument, the platform creates large data files that need to be stored.
Head to Head to Head
To test the usefulness of the three platforms for whole-genome bacterial sequencing side by side, the researchers sequenced the same organism on all three: Streptococcus suis, a pig pathogen with a 2 megabase genome and average GC content. They compared their results to the finished genome of the organism that had been generated by capillary sequencing.
The researchers analyzed data from two GS20 runs, creating 600,000 reads with an average read length of 98 base pairs; one run on the Illumina platform, generating 3.2 million reads of 26 base pairs; and one run on ABI’s system, obtaining 16.2 million reads with a 25 base pair read length. (Goodhead pointed out that Illumina’s platform can now generate more data per run.)
“Other than slight variations in the coverage down to idiosyncracies of the instruments themselves, generally speaking, the only gaps are down to repeat structures.”
Overall, all three platforms covered a similar fraction of the genome, between 97 percent and 98 percent. “Other than slight variations in the coverage down to idiosyncracies of the instruments themselves, generally speaking, the only gaps are down to repeat structures” that make up about 3 percent of the genome, Goodhead said.
However, the platforms did differ in the number of errors they produced in the consensus sequence. The GS20 generated approximately 400 errors in total, about 150 from low coverage. The remaining errors were present in a large fraction of the data, and many of them resulted from single base pair deletions. Notably, there were no substitution errors. “We are confident that a lot of these [errors] are probably down to homopolymers,” Goodhead said.
Data from the Illumina instrument contained 56 errors in total. However, all of these resulted from low depth coverage. “Everything that we covered above 3x showed no errors in the consensus [sequence],” Goodhead pointed out.
For the SOLiD platform, the researchers have not analyzed all the data yet. Goodhead showed that the system generated 1,049 measurement errors, which the 2-base encoding scheme reduced to 102. These remaining errors need to be analyzed further, he said, but he is confident that “that figure will decrease.”
In the coming months, the Sanger researchers plan to test the three platforms further in various ways. For example, they want to assess their paired-end capabilities, and evaluate the assembly of de novo genomes with each platform. In addition, they want to analyze their usefulness for genotyping strains of microorganisms. They have already sequenced five Salmonella strains with the GS20, generating about 1,000 SNPs, which they are currently confirming by capillary electrophoresis sequencing, Goodhead said.
Sanger scientists are also developing a new version of the Gap4 genome assembly program, which will be available “shortly,” Goodhead said, and will allow researchers to view 454, Illumina, and ABI SOLiD data.
Finally, they plan to use the new technologies to finish bacterial genomes that have been sequenced by capillary electrophoresis sequencing.