Illumina and Life Technologies division Applied Biosystems said separately last week that by the end of the year, improvements in reagents and processes, software, and hardware will enable their second-generation sequencing technologies to approach 100-gigabase runs and to sequence a human genome at high coverage for $10,000 in reagent costs.
Both companies said they want to increase the read length, number of reads, accuracy, and speed of their systems; to improve barcoding for multiplexed sequencing; and to support targeted resequencing with Agilent’s new SureSelect target selection method as a front end.
The firms made their announcement at the Advances in Genome Biology and Technology meeting in Marco Island, Fla., last week. During a workshop held by Illumina, vice president and chief scientist David Bentley said that the company expects to approach a yield of 100 gigabases per run through a combination of improvements in various areas.
For a start, the company has managed to increase the number of reads per run on the Genome Analyzer in several ways. By redesigning the manifold that holds the flow cell on the instrument, for example, the objective can now move closer to the ends of the flow cell, thus increasing the imageable area by 20 percent, or 20 tiles per channel.
The new flow cell holder is part of a hardware upgrade kit that also includes a reagent cooler for larger reagent volumes, and can be ordered by customers now.
The company has also developed a new analysis algorithm that detects more raw clusters in the image and resolves more clusters uniquely. As an example, Bentley showed results from a paired 100-base read run that was analyzed with the old and the new software. The new algorithm increased the number of purity-filtered clusters per lane from 16 million to 24 million, and the yield from 26.1 gigabases to 38.6 gigabases.
This change in the principle of cluster detection, which will be part of an upcoming analysis pipeline upgrade to version 1.4, also increases the accuracy of the data, both for the first and the second read of a paired-end run, according to Bentley.
Another way of improving the read accuracy has been by using a more accurate polymerase in the cluster-generation process, in which any error introduced early propagates to the entire cluster. The new, high-fidelity polymerase, which Illumina will start selling “later this year,” currently leads to sequencing quality values of up to Q42, according to Bentley.
Throughput can be further increased “for practically no additional cost” by increasing the density of the clusters, he said. So far, the GA has been using random, unordered arrays of clusters, but the company has now started to explore so-called “semi-ordered” arrays, taking a page from Illumina’s BeadArray technology, where beads fall into microwells arrayed on a substrate.
Unlike BeadArrays, the sequencing clusters are “not perfectly” hexagonally packed on the semi-ordered arrays, but “there is a level of regularity,” according to Bentley. As a result, cluster density increases, and clusters are more uniform, he said.
In a proof-of-principle experiment, Illumina scientists sequenced from a semi-ordered cluster array for 60 cycles and found that the sequencing performance improved. In particular, 97-percent of clusters passed quality criteria, the average error rate over the 60 cycles was 0.1 percent, and 96.4 percent of the reads were perfect.
Based on these results, Bentley projected that paired 125-base reads from semi-ordered arrays will yield 64 gigabases of data per run.
So far, the company has used 1-micrometer beads and its standard imaging pipeline to analyze the data from the semi-ordered arrays. In the future, with a more optimized analysis pipeline and sub-micrometer beads sizes, the number of reads will increase even further, Bentley predicted.
Read length will also go up, enabled by chemistry improvements such as a new deblock reagent and a new sequencing polymerase, and has the potential to reach 150-bases for paired reads, he said. Several runs at Illumina’s production facility with 125-base paired read runs have already exceeded 40 gigabases, Bentley said, adding that the company continues to optimize algorithms to analyze the longer reads.
The cycle time is getting shorter as a result of the new sequencing polymerase and “related protocol changes,” so the overall run time will stay “reasonable” despite the increase in read length.
[ pagebreak ]
Longer reads have also allowed the company to sequence 250-base fragments completely by reading them from both ends with overlapping 150-base and 125-base reads. The combined error rate in the overlap region decreases “significantly” from the relatively high error rates at the ends of the individual reads, Bentley pointed out.
Illumina researchers believe that they will be able to produce up to 300-base continuous reads in this manner, which Bentley said could improve the assembly of genomes.
With regard to sample barcoding, Bentley said that Illumina already released 12-plex indexing barcodes and is currently evaluating a 96-plex system. “We clearly see the indexing as having a big impact on people who have many samples to run with limited targets, such as targeted resequencing on the GWAS regions,” he said.
Targeted resequencing, Bentley noted, “has been a challenge” but is now starting to be performed successfully by some users. Illumina, he said, is currently penning a co-marketing agreement with Agilent Technologies, under which Illumina will recommend using Agilent’s new SureSelect target-enrichment method and Agilent will promote the Genome Analyzer as a readout platform.
In terms of sample preparation, he said that Illumina is now supporting several new protocols that improve the process, including adapters for PCR-free sample prep, a gel-free preparation that allows users to automate the workflow, and template quality control via qPCR.
Like Illumina, ABI is working to increase the output of its SOLiD 3 system, which it will start shipping later this week.
“We think we can get to 100 [gigabases] by the end of the year,” said Kevin McKernan, ABI’s senior director of SOLiD scientific operations, during a company workshop at the AGBT conference last week. “And that’s not going to be the ceiling.”
The system is likely capable of churning out 150 gigabases per run eventually, and McKernan said there is a “roadmap” to 250 gigabases per run, and a terabase per run might be possible.
SOLiD 3 is the third version of the instrument, which ABI originally launched in the fall of 2007, and will be considered by many as “the real production instrument,” McKernan predicted. He said that others have suggested SOLiD 1 was an alpha-version and SOLiD 2 a beta-instrument, although the company never said so.
Using paired 50-base reads, ABI has reached 40 gigabases of data per run internally, but it is working on increasing the yield through a combination of higher bead density, longer reads, bead tags, and new mapping software.
According to McKernan, a “progressive mapping tool” will allow users to extract 30 to 50 percent more data out of a run. The new scheme initially finds reads that map well, then removes these from the data, and “moves on to more aggressive mapping” for the remaining reads, he said.
[ pagebreak ]
The company has already increased the read length to 75 bases by modifying ligation conditions and is now “pushing this to 100 [bases],” he said, though reads of this length will “take more time” to optimize.
In order to increase the number of reads, ABI has started tagging the DNA-carrying beads with four different fluorescently labeled oligos instead of just one. These tags allow the imaging software to better distinguish different beads, thus allowing for more beads to be packed on the slide, “and that starts putting us into that 160 [gigabase per run] range,” McKernan said. In addition, these tags can be used as additional barcodes in multiplexed sequencing, he said.
To pack the beads even more densely, ABI, like Illumina, is considering using semi-ordered arrays that order the beads in one dimension. This improvement will allow the company to achieve 250 gigabases and 2.8 billion of reads per paired-end run, McKernan projected.
To make all these improvements, ABI might install a “slightly different” slide holder on the SOLiD instrument, as well as new software, he said, but no other changes will be required.
The company is also working on a scheme that will allow it to sequence pieces of DNA from a fragment library from both ends, making use of the ability of DNA ligase to proceed in both directions. In “really early work,” company researchers have shown that they can sequence 100 bases in one direction and 25 bases in the opposite direction, McKernan said, adding that he is “really optimistic this number will go up.” He said that the researchers have seen a “pairing rate” of up to 99.5 percent in these experiments.
In terms of multiplexing, ABI is currently able to run up to 96 samples in parallel using barcodes, and expects to get up to 384 samples using the four-color beads.
The company is also working on processes that will improve library construction, including so-called “express libraries” where 90 percent of the DNA shears to the right size range, so no size selection will be required.
For targeted resequencing, ABI has an ongoing collaboration with Agilent Technologies to use Agilent’s new SureSelect targeted enrichment technology, as well as with its sister division Invitrogen on using long-range PCR for the same purpose.
Finally, McKernan mentioned that as a result of the outcome of the recent patent lawsuit against Illumina (see In Sequence 2/3/2009), during which one patent was invalidated and ABI was found not to infringe Illumina’s IP, the company now has the freedom to incorporate single-base encoding probes into its sequencing scheme, in addition to the two-base encoding probes it is currently using.
Up until now, ABI has used two-base encoding “religiously” because “we really believe in the error correction” it provides, according to McKernan. However, combining one-base and two-base encoding “actually produces a much better product,” he said.
McKernan explained that in the absence of a reference sequence, it can be difficult to use two-base encoding data alone, requiring users to align reads against all other reads. However, one-base encoding data can help build a scaffold in “base space” that can serve as a conversion tool for the two-base encoding reads from “color space” into “base space.”
Error correction could be improved further by using higher encoding schemes, such as three-, four-, or five-base encoding, he said. Five-base encoding, for example, would reduce the sequencing error rate to 1 in 2 million in theory, enough to detect rare mutations in cancer reliably.
Decoding data from five-base encoding would be complex, but “this becomes a lot easier to contemplate if you have a base-space scaffold,” according to McKernan. “If you have a single-base coded reference, you can work this puzzle out, it’s not nearly as challenging.”