By Monica Heger
Researchers at the University of Washington have devised a way to improve the accuracy of short-read sequencing on the Illumina Genome Analyzer that could potentially extend the use of short-read methods into applications that have so far been limited to long-read platforms.
Dubbed subassembly, the technique uses tag sequences to recognize groups of short reads that are derived from the same larger fragment during library construction. The method could be particularly useful for de novo genome assembly and metagenome sequencing, the researchers report in a paper describing the approach that was published this week online in Nature Methods.
The subassembly method "basically allows one to use an Illumina like a 454, in the sense that you can get the equivalent of long reads, despite the fact that Illumina is a short read sequencer," Jay Shendure, assistant professor of genome sciences at the University of Washington and senior author of the paper, told In Sequence.
"Despite progress in algorithms to deal with short reads, there are still a number of applications where long reads, and particularly accurate long reads, continue to be important," he noted.
Shendure and his colleagues tested their subassembly technique on the pathogen Pseudomonas aeruginosa. First, they created a library of 500-base-pair fragments, added tag adapters to the fragments, and diluted and amplified them. They then shattered those fragments, creating a sublibrary, and added different adapters. They sequenced each of the shorter fragments, and then, because they were all tagged, they were able to rejoin them into the original 500 base pair fragments.
"We're creating a set of nested sublibraries where one can, after the fact, link short sequences as having been derived from the same slightly longer fragments, and then reconstruct these subassembled reads through highly localized subassembly," Shendure said.
Other groups have tried combining the Illumina and 454 platforms to take advantage of 454's longer read lengths, including the consortium that sequenced the turkey genome (see In Sequence 12/1/2009), and a University of North Carolina team that sequenced a rice pathogen. Shendure said his technique yielded comparable results as these combined approaches, but has the advantage that it can all be done on one platform. While Shendure tested it only on the Illumina platform, he said that it could potentially be used with other short-read sequencers such as Applied Biosystems SOLiD system.
Shendure and his colleagues reported that the method covered 98.85 percent of the reference genome at an average coverage of 63-fold. To test the method's utility in de novo assembly, they used a shotgun assembly approach, generating an N50 contig size of 15 kilobases and an N50 scaffold size of 440 kilobases. The substitution error rate was about one in 14,000, and there were seven misassemblies.
The researchers also tested subassembly on a previously characterized metagenomic sample from lake sediment of methylamine-fixing microbes. They used their technique and also a paired-end sequencing approach on the Illumina platform, and compared the two approaches to each other and to the previously reported Sanger sequence data.
Compared to the paired-end approach, subassembly yielded more total sequence data in longer contigs. Compared to Sanger sequencing, subassembly generated slightly more total sequencing data — 39.5 megabases as opposed to 37.2 megabases. Subassembly did have shorter contigs, though. The N50 was 390 base pairs for subassembly versus 835 base pairs for Sanger. However, subassembly is still easier and less expensive than Sanger sequencing, the authors reported: they only had to run three Illumina lanes for subassembly, as opposed to hundreds of Sanger sequencing runs.
"It's clever," said Corbin Jones, assistant professor of evolutionary genetics and genomics at the University of North Carolina, who was part of the team that sequenced the rice pathogen with Illumina and 454. "It's addressing a real problem — the problem of how do you take the short-read technologies and make them adaptable to something that's more complex than a highly compact microbial genome," he said.
Jones added that he thought the accuracy of the method was good, but noted that it may be more of a transition technology — useful until sequencing vendors can affordably offer longer read lengths. However, he said it could be useful for specific applications, like characterizing the microbiome, or for metagenomic analyses. "It does allow you to do a better job of capturing a sample and getting a unique sequence out of it," he said.
Jones would also like to see the technique scaled up, so the original subassembly fragment lengths are between 1 kilobase and 5 kilobases long, as opposed to 500 base pairs. Shendure agreed and said that his next step is to do exactly that. He thinks the technique can easily be scaled up to get subassembly reads of 1 kilobase, and will require a bit more work to reach 5 kilobases.
Shendure said he is excited to start using the method on sequencing projects. In particular, he wants to use it to profile the immune repertoire as well as large regulatory elements such as promoters. It will be especially useful for these projects, he said, because, the segment of interest on the T cell and B cell receptors, where you want high-quality contiguous sequence, for example, is longer than Illumina's current read lengths.
In the past, "every time there was something we wanted to do that required a longer read length, we were a little frustrated," Shendure added.