A number of bioinformatics projects underway indicate that de novo genome assembly using extremely short reads from next-generation instruments — once thought impossible — could begin generating new bacterial genomes in the next few months.
While the various platforms in the emerging next-generation sequencing market each rely on different technologies to translate DNA molecules into a machine-readable format, they all have one thing in common: read lengths that are far shorter than those used in traditional Sanger sequencing.
These short reads, which can range from 25 base pairs to around 250 base pairs as opposed to an average of 750 base pairs for capillary electrophoresis, have presented an obstacle for conventional assembly algorithms that rely on long stretches of overlapping sequences to build out the contiguous regions, or contigs, that are the foundation of de novo genome assembly.
This challenge is particularly acute for platforms that generate so-called “microreads” of around 30 base pairs, such as Illumina’s (formerly Solexa’s) Genetic Analyzer or Applied Biosystems’ SOLiD technology.
Until very recently, the idea of de novo assembly with microreads was considered impossible, and these platforms were considered better suited for resequencing projects in which the short reads are aligned to a reference genome.
Over the last six months, however, several bioinformatics developers have begun looking into the problem a bit more closely and have made significant progress in creating a new generation of assemblers that can stitch microreads into full-length genomes.
Projects are currently underway at Illumina, the Broad Institute, the British Columbia Cancer Center’s Genome Sciences Center, Stony Brook University, and elsewhere to create new assembly algorithms for microreads.
These methods are now at a crucial point in their development because they are getting their first taste of real sequence data. The Illumina Genetic Analyzer has just entered the market and ABI’s SOLiD is not yet available, so most of these algorithms were developed and tested on simulated data — microreads generated by chopping up known genomes into tiny chunks — and are only now moving into the proof-of-principle stage.
Most of these developers “have been working with theoretical in silico data and we’re just starting to see some of them being applied to real data sets,” Francisco De La Vega, senior director of computational genetics at ABI, told BioInform this week.
“This is really early work and most of these algorithms are untested — they’re theoretical,” he said. “They still have yet to be fine-tuned to the error profile of these reads, which is different than with [capillary electrophoresis].”
De La Vega is organizing a “birds of a feather” session at the upcoming Research in Computational Molecular Biology conference in San Francisco that will focus on emerging algorithms for next-generation sequencing applications. One of the goals of that session, he said, is for developers of short-read assembly approaches to discuss how they are testing these algorithms and what they are finding.
ABI is not planning on developing a de novo assembler on its own, De La Vega said. “Our position here in this marketplace is that we will deliver the core technology — the platform — and some basic tools, but I don’t think we’re going to get into the assembly business,” he said. “There are so many applications, and de novo genome assembly is just one of them, that it would be difficult for us to get resources to deal with all of them.”
However, he noted, the company is hoping to “create a community and nurture this community so that independent researchers in the bioinformatics area can start contributing to the field.”
Illumina, on the other hand, is working on a de novo assembler that is currently at the proof-of-principle stage, according to Tony Cox, principal scientist at the company’s UK computational biology group.
“Over the past few years our thinking has evolved from detecting just single base differences from a reference towards encompassing more complicated differences such as indels and copy number variants, and our X chromosome sequencing collaboration with the Sanger Institute and our new paired read protocol have both opened up many exciting possibilities in this regard,” Cox explained via e-mail.
While noting that “people tend to make an artificial distinction between resequencing (aligning to a reference) and pure de novo assembly (where you start from nothing),” Cox said that his team is nevertheless working on a “pure de novo assembly algorithm.”
In Theory …
Several bioinformatics developers began working on this problem while ABI’s and Solexa’s technologies were barely prototypes.
“We have been working for about two, three years on trying to do de novo assembly for short read technologies,” said Steve Skiena of Stony Brook University, who has developed an algorithm called Shorty for the de novo assembly of read lengths of around 20 to 30 base pairs.
“Up until very recently, a lot of it has been sort of hypothetical because people didn’t know how good the technologies were going to be, how long the read lengths were,” he said. “But now the technologies are getting mature enough and there’s a clear vision of what kind of data is going to be produced by these — or a much clearer version than before.”
Currently, Skiena said, “We’re at a point where we’re starting to work with real and simulated data from companies as opposed to our fantasies about what their data was going to be like, given published specs.”
René Warren of the BC Genome Sciences Center said that his group is at a similar point. Warren co-authored an applications note that was published in Bioinformatics in December describing an algorithm called SSAKE (Short Sequence Assembly by progressive K-mer search and 3’read Extension), which clusters and assembles 25-mers into longer contigs.
Warren said that he and his colleagues began developing the algorithm last fall before they had a next-gen sequencer, “and at that time, it was pretty much accepted that no one would be interested in doing de novo assemblies with that data because of the read length,” he said.
While the Bioinformatics paper was published based solely on simulated data, Warren said that his lab has since begun generating sequencing data from an Illumina Genetic Analyzer and has just started running SSAKE on real data.
“Our expectations were not really high,” he acknowledged, but early results appear to be promising, he said.
In one example, using a human BAC that Illumina recommends as a resequencing control, the GSC group generated 490,000 25-mers for 70X coverage of the BAC. Warren used SSAKE to assemble those reads into 13,000 contigs with an average size of 44 bases.
Using only those contigs that were longer than 75 nucleotides — around 10 percent of the total — Warren found that they covered 98.4 percent of the BAC with 96-percent sequence identity.
“I was pretty happy to see this,” he said.
Other algorithms have also relied on simulated data until very recently. At the Advances in Genome Biology and Technology conference in February, Jonathan Butler of the Broad Institute presented early results from an algorithm called ALLPATHS that was developed to assemble read lengths of around 30 base pairs.
In his presentation, he discussed results based on simulated data from a reference genome that was “assigned an error pattern modeled after real [Illumina] reads,” according to the abstract for his talk. At the time he said that he “expect[ed] real data soon,” though he acknowledged that “this will present new challenges.”
“We’re at a point where we’re starting to work with real and simulated data from companies as opposed to our fantasies about what their data was going to be like, given published specs.”
ABI’s De La Vega said that the proof of the pudding for these new assembly algorithms will be in real data, which includes errors and biases that simulated data can’t account for.
“Definitely there are going to be more errors right now than what you see in CE, and that needs to be dealt [with] at the assembly level, too, because a short read with one or two errors suddenly becomes a lot more difficult to align and to assemble,” he said.
Even those synthetic data sets that do account for errors do a poor job of replicating real biological data, he said. “They are assuming random error models, and in reality there is always going to be some bias in the system.”
For example, he said, most sequencing platforms exhibit some bias across the length of a read, or bias related to GC content. “Those things I don’t think have been taken into account right now,” he noted, “but if the error profile is well understood, and those biases are understood, than you can compensate in the algorithm for that.”
As real data becomes available to test these algorithms, he said, “I think that then there is going to be the need to do some tweaking to adjust for the error profiles of these new technologies.”
One relatively recent development in the field of next-gen sequencing is the availability of paired-read data, which “is really a critical element in being able to make de novo assemblies from these platforms,” De La Vega said.
“If there are CE backbone scaffolds maybe you can get away with not having the mate pairs, but if you want an assembly, I think all of these algorithms are assuming that mate pairs are going to become available,” he said.
Illumina’s Cox agreed. “Clearly, as with any assembly, the resolution of repeat regions is the tough bit, and most of the short read assembly algorithms I know rely on read pairing to help out with this,” he said, noting that the availability of Illumina’s paired read data “opens up exciting new possibilities here.”
Stony Brook’s Skiena said that his Shorty algorithm relies on the availability of mate pairs. “The fact that you’ve got two reads that are a certain distance apart from each other, where there is some expected distance and some variance there — this turns out to be a powerful thing to help you in assembly, and it turns out to be much more important in assembling short reads than in long reads,” he said.
SSAKE doesn’t use mate pair data, but Warren noted that the algorithm will primarily be used to characterize unknown genomes within metagenomics data sets via Blast searches and gene-prediction tools, rather than de novo assembly of individual genomes.
For pure de novo genome assembly using microreads, he noted, paired-end data will likely be necessary. “Until we have a better feel for sequence or base quality with these short reads, and the read length becomes a bit bigger, and we have some information to put these contigs in the context of the genome — pairing information — it’s not going to be trivial, and people will have a problem in assembling very contiguous sequences,” he said.
Despite advances in the field, “I haven’t seen yet a de novo assembly completely out of short reads,” De La Vega said. “I expect that’s because the data sets are just now being generated, and [the fact] that for short reads, mate pairs are necessary.”
However, he added, the availability of such an assembly may be only a few months away. “I’m confident that, based on the people I know and the data that’s been generated, that by the summer we’re going to start looking at some microbial genomes assembled de novo from short reads,” he said.
Skiena said even though the quality of the data from next-generation sequencers is “still a moving target,” he’s confident that de novo assembly is possible with microreads.
“Do I believe you can do de novo assembly of bacteria to an interesting state using reads of length 20? The answer is yes. Exactly what ‘interesting’ means is a separate question, but I am convinced that is doable,” he said.
“The question about higher organisms is maybe a little bit more open, but I still believe that if you work hard enough at the assembly and have high enough coverage, I think you could produce something interesting.”