A team from the US and Singapore has come up with a method for nearly doubling Illumina read lengths using overlapping libraries, producing reads that approach the 500 base pair mark.
The strategy relies on the production of "contiguous" reads from two shorter sequences that overlap by a few dozen bases, in some experiments, or by just a handful of bases in the case of the most well-characterized samples, explained Stephan Schuster, a researcher affiliated with Pennsylvania State University and Singapore's Nanyang Technological University.
"With samples that we know work really well, we go for 480-base pair [reads]," he told In Sequence. "And … we can do this within 10 base pairs."
The approach is proving useful not only for sequencing well-defined samples, but also for defining taxa and assembling genomes in metagenomic samples taken from environmental settings.
At the recent Advances in Genome Biology and Technology meeting in Marco Island, Fla., for instance, Schuster presented findings from a metagenomic study of wastewater samples that were sequenced using the Illumina contiguous read method. In the four samples described there, researchers generated contiguous Illumina reads that ranged from about 400 base pairs to 430 base pairs, on average.
He and his colleagues plan to publish additional details on the sequencing strategy and related results in the not-too-distant future.
Though the work presented at ABGT centered on Illumina's HiSeq 2500 platform, the group got the first inklings that such an approach might be possible while working with the company's lower capacity MiSeq instrument.
Working with the MiSeq, "it soon became clear that the data quality was high after the last base" in each read, Schuster noted, prompting speculation that it might be possible to stitch together reads from overlapping libraries.
By putting together two 250-base-pair reads that overlap by 20 bases, for instance, the researchers reasoned that they should be able to get a contiguous read of 480 base pairs.
Generating the contiguous Illumina reads involves making tweaks at almost every stage of the sequencing process — from sample preparation through data analysis, starting with careful library construction.
"The most important part that needs to be different is the way you design and prepare the libraries," Schuster explained. "And the key element to that is the sizing."
Using Sage Science's automated gel electrophoresis platform, known as Pippin Prep (IS 10/26/2010), the researchers nab strategically sized inserts, siphoning these stretches of DNA into a collection chamber that can be tapped for the subsequent library preparation steps.
"For the libraries where we combine the two reads in the end, it's very important that the insert size of the library must be very precise," Daniela Drautz, the genomics center at Nanyang Technological University in Singapore, told IS.
But while precise size selection is a key component of the library construction process, it is also rate limiting at the moment, explained Drautz, who constructed the overlapping libraries used so far. That's because this insert selection is typically done on no more than four samples simultaneously.
"The Pippin Prep, which we use for size selection, only allows you to run a maximum of four samples," Drautz said.
"And the size selection takes roughly two hours," she explained. "So if you want to process many samples at once, then the Pippin Prep is definitely the bottleneck."
In an effort to overcome that, the group has started to pool some samples in equal volumes prior to the size selection step. After size selection and other library prep steps, these pooled samples can then be sequenced as a group on a single sequencer lane.
So far the approach has proven feasible for sequencing both genetically homogeneous samples and metagenomic samples made up of sequences from many organisms, though somewhat longer overlaps are needed to seal together reads from the most complex samples.
"With the best samples — where we have also done repeated libraries to optimize the overlap — we can get an average read length after sequencing of 460 base pairs with 94 percent of the data making [overlapping] pairs," Schuster said.
"The metagenomic samples are a much bigger challenge," he explained. "This is where we see a somewhat wider [overlap] size range."
Part of the problem with these multi-species mixtures is the wide ranges of guanine and cytosine nucleotide levels that they can contain. An apparent consequence of that GC variability is that the metagenome libraries "do not get the peak performance that we would see, for example, on a human sample," Schuster said.
"This is something that you can counteract with higher coverage," he added, explaining that Illumina's latest HiSeq 2500 chemistry "muscles through" in situations where such added coverage is advantageous.
For the wastewater metagenomics work presented at AGBT, for example, Schuster and his colleagues generated 366 million read pairs — 92 billion bases of sequence data — on libraries from four environmental samples using two HiSeq 2500 lanes.
Those reads had an average read error of 0.6 percent, Schuster reported at AGBT, and often stretched out past the 400 base pair mark.
Error rates appear to be somewhat higher in sequences where two reads are stitched together to form the contiguous Illumina read, though the error rate did not seem to exceed around 1 percent, even in those relatively tricky regions.
In each of the four metagenomic samples, around 80 percent to 88 percent of the sequence data contributed to a contiguous read, with reads in each of the samples averaging between 400 bases and 430 bases apiece.
While he conceded that a couple hundred extra bases per read may seem like a modest improvement for some, Schuster argued that these extended reads can confer a big advantage for some experiments.
For instance, a 250 base pair jump in read lengths "allows us to really take complex metagenomes and sequence them to saturation with long reads," according to Schuster.
"This old mantra that you will never be able to sequence all of the molecules in the mixture of a metagenome — I think we are challenging that view," he said.
In the most complex metagenomic samples that they've evaluated so far, for example, the researchers have been getting up to around 80 percent coverage of the genomes of the most frequent organisms in that sample, which typically contribute some 3 percent of the overall sequence data.
And the two lanes of HiSeq 2500 data appears to have been enough to see about 95 percent of all molecules in the metagenomic mixtures tested at least twice, according to Schuster.
Moreover, he explained that the sets of taxa identified in wastewater samples sequenced with the Illumina contiguous read method were similar to those found with longer but pricier Roche 454 reads — which Schuster called a "huge accomplishment" for Illumina.
"If it is for metagenomic analysis, where you have many individual reads and you want to taxonomically identify where they might be coming from, I think that the HiSeq has made major inroads toward being similarly useful [compared to the 454]," he said.
Such similarities are even more impressive given the difference in price for the data. In the case of the wastewater samples described at AGBT, Schuster and his team estimated that the information gleaned from around $5,000 worth of contiguous HiSeq 2500 reads would require an investment of around $770,000 on the Roche 454 XL+ platform.
The reagent costs for constructing the overlapping libraries come in at an estimated $175 per sample, Drautz noted, though that does not include the time and labor needed for the time-consuming method.
Given that price difference, Schuster argued, "you might be inclined to forgive Illumina if they miss one or another taxon" in a metagenomic sample.
Still, he emphasized that his team sees an added benefit in having the long 454 reads when putting together new genome assemblies, since the Illumina and Roche 454 have distinct error models and because 454 reads are apt to cover parts of the genome assembly missed using Illumina reads alone.
"The two methods, despite the very different cost structure, are perfectly complementing one another," Schuster said.
In particular, he noted that longer, contiguous Illumina reads show promise for complementing Roche 454 reads in genomes assembled with the Roche 454 Newbler assembler.
"Our contiguous Illumina reads are reads that are very much what Newbler wants to see," Schuster said. "And now we're working with 454 on new versions of Newbler that are perfect for combining the 454 data and the Illumina data in a single assembly step."
For more general analyses of the contiguous read data, he and his colleagues are currently using a bioinformatics pipeline developed in house, though Schuster said he's keen to see whether Illumina would be interested in tacking on a step for dealing with contiguous read data as part of its standard output pipeline.
He is also optimistic that the availability of a variety of longer read methods will spur interest in the development of new tools for analyzing metagenomic data in general since it's currently "a wide open field for coming up with assembly tools that would particularly assess the needs of metagenomes."
In addition to metagenomic sequencing and genome assembly, Schuster noted that the contiguous Illumina reads may hold promise for future studies of host-adapted genomes, an area in which his group has been interested in for some time.
The ability to generate longer Illumina reads in that context is expected to be especially beneficial for aligning and analyzing reads from host-adapted organisms assessed through ultra-deep sequencing on host samples, Schuster noted.
Finally, beyond applications for the overlapping library-based method as it stands today, Schuster sees potential for combining the scheme with something like the long DNA fragment-based method that Moleculo has developed for extending Illumina reads (IS 01/15/2013).
For instance, he argued that the availability of longer contiguous reads might be beneficial for putting pieces of each 10 kilobase fragment together using Moleculo's approach, particularly when fragments contain repeats or other tricky to assembly sequences.
Schuster and his colleagues have submitted a paper outlining the use of the Illumina long read method for characterizing metagenomic samples. They also plan to publish more details on the library construction methodology used to create the overlapping libraries needed to get the longer-than-usual Illumina reads.
"We are in the process of writing up our initial findings and we will also describe how we're currently doing those libraries," he said.