Two next-generation sequencing platforms currently hitting the market, Illumina’s Genetic Analyzer and Applied Biosystems’ SOLiD technology, generate so-called “microreads” of currently around 30 base pairs — a constraint that has so far excluded their use alone in de novo genome assembly, and limited their application mostly to resequencing projects in which the short reads are aligned to a reference genome.
However, a number of bioinformatics projects are underway that indicate that de novo genome assembly using such very short reads — once thought impossible — could begin generating new bacterial genomes in the next few months.
While the latest version of 454 Life Sciences’ Genome Sequencer, the first next-gen system to hit the market, generates reads that are only up to 300 base pairs long, as opposed to reads reaching 800 base pairs from capillary electrophoresis sequencers, several projects have demonstrated that de novo assembly, while tricky, is possible with 454 reads using modified versions of conventional assembly algorithms.
Microreads, however, present a more different problem for these assemblers, which rely on long stretches of overlapping sequences to build out the contiguous regions, or contigs, that are the foundation of de novo genome assembly.
“With conventional assembly, you take two long fragments, and you’re looking for a long overlap,” said Steve Skiena of Stony Brook University, who is developing an algorithm for the de novo assembly of read lengths of around 20 to 30 base pairs. “But if you have reads that are this short, the statistical strength of any overlap is going to be limited. If you have two reads with a length of 20 bases, the most they can overlap is 19,” he said.
Over the last six months, however, Skiena and several bioinformatics developers have begun looking into the problem more closely and have made significant progress in creating a new generation of assemblers that can stitch microreads into full-length genomes.
In addition to Stony Brook, such projects are currently underway at Illumina, the Broad Institute, the British Columbia Cancer Center’s Genome Sciences Center, and the European Bioinformatics Institute to create new assembly algorithms for microreads.
These methods are now at a crucial point in their development because they are getting their first taste of real sequence data. The Illumina Genetic Analyzer has just entered the market and ABI’s SOLiD is not yet available, so most of these algorithms were developed and tested on simulated data — microreads generated by chopping up known genomes into tiny chunks — and are only now moving into the proof-of-principle stage.
Most of these developers “have been working with theoretical in silico data and we’re just starting to see some of them being applied to real data sets,” Francisco De La Vega, senior director of computational genetics at ABI, told In Sequence’s sister publication BioInform last week.
“This is really early work and most of these algorithms are untested — they’re theoretical,” he said. “They still have yet to be fine-tuned to the error profile of these reads, which is different than with [capillary electrophoresis].”
De La Vega is organizing a “birds of a feather” session at the Research in Computational Molecular Biology conference in San Francisco next week that will focus on emerging algorithms for next-generation sequencing applications. One of the goals of that session, he said, is for developers of short-read assembly approaches to discuss how they are testing these algorithms and what they are finding.
ABI is not planning on developing a de novo assembler on its own, De La Vega said. “Our position here in this marketplace is that we will deliver the core technology — the platform — and some basic tools, but I don’t think we’re going to get into the assembly business,” he said. “There are so many applications, and de novo genome assembly is just one of them, that it would be difficult for us to get resources to deal with all of them.”
However, he noted, the company is hoping to “create a community and nurture this community so that independent researchers in the bioinformatics area can start contributing to the field.”
Illumina, on the other hand, is working on a de novo assembler that is currently at the proof-of-principle stage, according to Tony Cox, principal scientist at the company’s UK computational biology group.
“Over the past few years our thinking has evolved from detecting just single base differences from a reference towards encompassing more complicated differences such as indels and copy number variants, and our X chromosome sequencing collaboration with the Sanger Institute and our new paired read protocol have both opened up many exciting possibilities in this regard,” Cox explained via e-mail.
While noting that “people tend to make an artificial distinction between resequencing (aligning to a reference) and pure de novo assembly (where you start from nothing),” Cox said that his team is nevertheless working on a “pure de novo assembly algorithm.”
In Theory …
Several bioinformatics developers began working on this problem while ABI’s and Solexa’s technologies were barely prototypes.
“We have been working for about two, three years on trying to do de novo assembly for short read technologies,” said Stony Brook’s Skiena, who is developing a microread assembly algorithm called Shorty.
“Up until very recently, a lot of it has been sort of hypothetical because people didn’t know how good the technologies were going to be, how long the read lengths were,” he said. “But now the technologies are getting mature enough and there’s a clear vision of what kind of data is going to be produced by these — or a much clearer version than before.”
Currently, Skiena said, “We’re at a point where we’re starting to work with real and simulated data from companies as opposed to our fantasies about what their data was going to be like, given published specs.”
René Warren of the BC Genome Sciences Center said that his group is at a similar point. Warren co-authored an applications note that was published in Bioinformatics in December describing an algorithm called SSAKE (Short Sequence Assembly by progressive K-mer search and 3’ read Extension), which clusters and assembles 25-mers into longer contigs.
Warren said that he and his colleagues began developing the algorithm last fall before they had a next-gen sequencer, “and at that time, it was pretty much accepted that no one would be interested in doing de novo assemblies with that data because of the read length,” he said.
While the Bioinformatics paper was published based solely on simulated data, Warren said that his lab has since begun generating sequencing data from an Illumina Genetic Analyzer and has just started running SSAKE on real data.
“Our expectations were not really high,” he acknowledged, but early results appear to be promising, he said.
In one example, using a human BAC that Illumina recommends as a resequencing control, the GSC group generated 490,000 25-mers for 70X coverage of the BAC. Warren used SSAKE to assemble those reads into 13,000 contigs with an average size of 44 bases.
Using only those contigs that were longer than 75 nucleotides — around 10 percent of the total — Warren found that they covered 98.4 percent of the BAC with 96-percent sequence identity.
“I was pretty happy to see this,” he said.
Other algorithms have also relied on simulated data until very recently. At the Advances in Genome Biology and Technology conference in February, Jonathan Butler of the Broad Institute presented early results from an algorithm called ALLPATHS that was developed to assemble read lengths of around 30 base pairs.
In his presentation, he discussed results based on simulated data from a reference genome that was “assigned an error pattern modeled after real [Illumina] reads,” according to the abstract for his talk. At the time he said that “we expect real data soon,” though he acknowledged that “this will present new challenges.”
“Up until very recently, a lot of it has been sort of hypothetical because people didn’t know how good the technologies were going to be, how long the read lengths were.”
ABI’s De La Vega said that the proof of the pudding for these new assembly algorithms will be in real data, which includes errors and biases that simulated data can’t account for.
“Definitely there are going to be more errors right now than what you see in CE, and that needs to be dealt [with] at the assembly level, too, because a short read with one or two errors suddenly becomes a lot more difficult to align and to assemble,” he said.
Even those synthetic data sets that do account for errors do a poor job of replicating real biological data, he said. “They are assuming random error models, and in reality there is always going to be some bias in the system.”
For example, he said, most sequencing platforms exhibit some bias across the length of a read, or bias related to GC content. “Those things I don’t think have been taken into account right now,” he noted, “but if the error profile is well understood, and those biases are understood, than you can compensate in the algorithm for that.”
As real data becomes available to test these algorithms, he said, “I think that then there is going to be the need to do some tweaking to adjust for the error profiles of these new technologies.”
One relatively recent development in the field of next-gen sequencing is the availability of paired-read data, which “is really a critical element in being able to make de novo assemblies from these platforms,” De La Vega said.
“If there are CE backbone scaffolds, maybe you can get away with not having the mate pairs, but if you want an assembly, I think all of these algorithms are assuming that mate pairs are going to become available,” he said.
Illumina’s Cox agreed. “Clearly, as with any assembly, the resolution of repeat regions is the tough bit, and most of the short read assembly algorithms I know rely on read pairing to help out with this,” he said, noting that Illumina’s new paired read data “opens up exciting new possibilities here.”
Stony Brook’s Skiena said that his Shorty algorithm relies on the availability of mate pairs. “The fact that you’ve got two reads that are a certain distance apart from each other, where there is some expected distance and some variance there — this turns out to be a powerful thing to help you in assembly, and it turns out to be much more important in assembling short reads than in long reads,” he said.
SSAKE doesn’t use mate pair data, but Warren noted that the algorithm will primarily be used to characterize unknown genomes within metagenomics data sets via Blast searches and gene-prediction tools, rather than de novo assembly of individual genomes.
For pure de novo genome assembly using microreads, he noted, paired-end data will likely be necessary. “Until we have a better feel for sequence or base quality with these short reads, and the read length becomes a bit bigger, and we have some information to put these contigs in the context of the genome — pairing information — it’s not going to be trivial, and people will have a problem in assembling very contiguous sequences,” he said.
Despite advances in the field, “I haven’t seen yet a de novo assembly completely out of short reads,” De La Vega said. “I expect that’s because the data sets are just now being generated, and [the fact] that for short reads, mate pairs are necessary.”
However, he added, the availability of such an assembly may be only a few months away. “I’m confident that, based on the people I know and the data that’s been generated, that by the summer we’re going to start looking at some microbial genomes assembled de novo from short reads,” he said.
Skiena said even though the quality of the data from next-generation sequencers is “still a moving target,” he’s confident that de novo assembly is possible with microreads.
“Do I believe you can do de novo assembly of bacteria to an interesting state using reads of length 20? The answer is yes. Exactly what ‘interesting’ means is a separate question, but I am convinced that is doable,” he said.
“The question about higher organisms is maybe a little bit more open, but I still believe that if you work hard enough at the assembly and have high enough coverage, I think you could produce something interesting.”