A flurry of recent papers has established the possibility of de novo assembly of bacterial genomes from short-read sequencing data and has highlighted the importance of handling base errors in the reads, possibly foreshadowing commercial opportunities in this area.
Last week, Genome Research
published three studies highlighting new methods for de novo
assembly of short-read sequencing data: the Velvet algorithm under development at the European Bioinformatics Institute [BioInform 11-16-07
], a new algorithm from the Broad Institute called Allpaths for assembling paired-end reads, and an algorithm called Edena developed by researchers at Geneva University Hospitals.
But academic groups aren’t the only ones developing new methods for stitching together 35-base-pair reads into entire genomes — a task that many still consider to be too difficult for these technologies. At least one software company is seeking to position itself in this area, offering a short-read de novo assembly product that will run on a desktop computer.
In the next few weeks, DNAStar will roll out a new version of its SeqMan Genome Assembler software with enhanced next-gen sequencing assembly capabilities, Bob Steinhauser, DNAStar’s marketing director, told BioInform. Currently the software can handle de novo assembly for the Roche/454 Life Sciences platform and the new release will be able to assemble de novo datasets for the Illumina Genome Analyzer, he said.
Steinhauser acknowledged the fertile environment for academic development in this area, but noted that the company doesn’t consider freeware to be competition.
“There is going to be a certain audience that will have the structure that they can bring freeware in … and make it work,” he said. “There’s going to be another audience, and we feel a much larger one, that does not want to work with that.” He added that this sizable group — researchers who prefer a product with technical support — is his firm’s target audience.
For those researchers willing and able to work with freeware, there is a growing menu of available tools: at least nine different groups are developing algorithms to tackle this challenge (see table, below).
published by Geneva University Hospitals in Genome Research
last week compared four of these methods: Edena, short for Exact De Novo
Assembler; Velvet; SSAKE, from the Genome Sciences Center of the British Columbia Cancer Agency; and SHARCGS, from the Max Planck Institute for Molecular Genetics in Berlin.
The authors ran these algorithms on two bacterial genome datasets sequenced by the Illumina Genome Analyzer: close to 4 million 35-base-pair reads from the bacterium Staphylococcus aureus and nearly to 12 million 36-base-pair reads from Helicobacter acinonychis.
The authors found that Edena and Velvet outperformed the others “both in terms of assembly quality and required computer resources,” according to the paper.
“There is going to be a certain audience that will have the structure that they can bring freeware in … and make it work. There’s going to be another audience, and we feel a much larger one, that does not want to work with that.”
Edena and Velvet both generated fewer and larger contigs than SSAKE or SHARCGS, the study found. For example, Edena assembled the dataset for Staphylococcus aureus into 1,122 correct contigs with an average length of 2.5 kilobases, while Velvet obtained 1,093 correct contigs at an average length of 2.5 kilobases.
SSAKE and SHARCGS, meantime, produced 2,334 and 3,632 contigs for Staphylococcus aureus,respectively, at less than half the average length of the other two methods. For Helicobacter acinonychis, SSAKE assembled 1,368 correct contigs and SHARCGS put together 628, compared to Edena’s 336 and Velvet’s 340.
Furthermore, Edena and Velvet performed the assemblies in under 20 minutes on a desktop computer, whereas SSAKE and SHARCGS required several hours, and SHARCGS needed a supercomputer.
David Hernandez, a computer scientist and postdoc at Geneva University Hospital’s division of infectious diseases and lead author on the comparison study, explained via e-mail that Edena and Velvet generated better assemblies because “they have the best approach to handle base errors in the reads.”
Edena is based on the same “overlap-layout-assembly” approach that traditional assembly algorithms use, but it includes two key “cleaning operations” to help improve the assembly of short-read sequences: one that removes nodes in the assembly graph that lead to “dead-end paths” resulting from base-calling errors; and another that removes “bubbles” that are caused by polymorphisms in the sequence.
In addition, the researchers explain in the paper, Edena relies on exact matching instead of approximate matching, because it is “drastically faster” and helps reduce the number of spurious overlaps.
Velvet, by comparison, uses a somewhat different graph approach, as EBI’s Daniel Zerbino and Ewan Birney explain in their paper
, which found that the algorithm performed better than SSAKE and another algorithm, VCAKE from the University of North Carolina at Chapel Hill, using Illumina reads for the Streptococcus suis
Rather than using sequencing reads as the primary data structure in the assembly process, Velvet relies on a directed graph representation called a de Bruijn graph, which organizes the data by word lengths, or k-mers, in which each k-mer appears as a single node in the graph regardless of how many times it is observed.
According to the EBI researchers, this approach is better able to handle the high redundancy of short-read sequencing data than the overlap-based approach, and helps avoid mis-assembly errors.
With very short reads, “the sheer number of reads makes the overlap graph, with one node per read, extremely large and lengthy to compute,” they wrote. In addition, short reads in repeats are likely to have very few base differences in overlaps, which could lead to more “ambiguous connections in the assembly.”
Hernandez said he believes that Edena and Velvet are “equivalent in terms of assembly performance” and cited the methods’ graph-based approaches as the key to their improved performance. “The graph structure represents all possible ways to assemble the reads,” he said. “It offers the ability to identify and fix specific features in the graph such as the one caused by errors in reads.”
Likewise, in an e-mail to BioInform, EBI’s Zerbino noted that “the common strength of Velvet and Edena is to correct errors after graph building, meaning that editing decisions are made with contextual knowledge of overlapping reads, instead of just using coverage statistics.”
Nevertheless, Zerbino said he believes that the de Bruijn graph offers some advantages over the overlap-based approach. “Whether a piece of sequence was covered once, 20 times or 5,000 times, only one data structure represents it,” he said. “That facet saves both memory and time.”
Another advantage, according to Zerbino, is that this graph allows “seamless assembly of long and short reads.” Overlap graph techniques “expect” reads of roughly equivalent length, he said, but “in the de Bruijn graph, on the contrary, you can add any kind of sequence information, whether short reads, long reads, pre-assembled BACs, or even reference genomes indifferently. This opens the possibility of mixed-length sequencing.”
As the EBI researchers explain in their paper, Velvet first hashes reads into k-mers, constructs the graph, corrects errors, and resolves repeats. “Velvet concentrates on two things: not disrupting contigs and not losing reads,” according to Zerbino. “The first point is obviously important to produce long contigs, the second is crucial if we want to use paired-end read information. Every read removed from the graph is a pair which disappears.”
Velvet’s error-correction algorithm, called Tour Bus, “examines the topology of the graph and amends structures as little as it can,” explained Zerbino. It first searches for read “tips,” which do not align to anything, and removes the tips shorter than 2,000 base pairs but not the rest of the reads. Then it looks for bubbles, which are parallel branches in the graph that contain similar sequence and start and end at the same points.
“Velvet then remaps the branch with lower coverage onto the majority or consensus branch,” wrote Zerbino. “Optionally, it removes remaining low coverage k-mers which do not look genuine.”
Given the fact that the studies were limited to Illumina technology, one question is how these algorithms might work on other short-read platforms, such as Applied Biosystems’ SOLiD system.
“We have yet to implement and test Velvet for ABI reads, although it should be feasible,” Zerbino said.
He pointed out that these technologies have very different error models, “so it would be hard to emit a serious forecast on those results.”
Hernandez agreed that each platform has its own specificities in terms of error types and rates. However, he wrote, “Edena and Velvet should be able to directly assemble data produced with the SOLiD platform.”
Simulated Data Promises Paired Reads
While the Edena and Velvet studies used experimental short-read datasets for bacterial genomes, another study
published in Genome Research
last week used partially simulated Illumina data to test the applicability of another algorithm, called Allpaths, for assembling paired-end reads.
In the paper, researchers at the Broad Institute explain that the algorithm shows “it is possible to produce very high-quality assemblies based entirely on paired microreads at high coverage.” The scientists plan as a next step to move from simulated data to real data.
Allpaths also represents assemblies as graphs, an approach that “offers the tantalizing prospect of accurately capturing polymorphisms within assemblies themselves, and more generally the systemic capture of ambiguity, regardless of source,” the authors wrote.
“There are fundamental limitations in how far you can go with unpaired data and assembly,” MacCallum, team leader of the new sequencing assembly group at the Broad, told BioInform’s sister publication In Sequence. “We use the paired reads to give us extra information to overcome the difficulties that you have in assembling the short-read data.”
As Hernandez explained to In Sequence, paired-end data could make it possible to assemble data for genomes larger than bacteria, but for now, “de novo assembly for larger genomes is very difficult because there is more repetition.” Also because of the greater depth needed, more reads are required, which makes the undertaking expensive.
Some users still need to be convinced that de novo assembly will be feasible even for bacteria using only short-read data. Christopher Bauser, director of bioinformatics at genomics services firm GATC Biotech, said that his company currently uses a combination of Sanger, Illumina, and 454 sequencing for bacterial genomes and is not considering a change in strategy at this time.
“The [Illumina and SOLiD] reads are still just too short for it to be really useful,” he said. “The problem is always going to be that as soon as you have a repeat longer than about 20 bases, today's state-of-the-art SOLiD and Illumina machines won't be able to sequence over that … The assembly won't be able to handle these regions.”
Bauser acknowledged that “it won’t be long before the SOLiD and the Illumina system will have longer reads,” and noted that “the size of the repeat where they stumble is getting longer and longer.” However, he added, “they’re still stumbling at repeats for de novo sequencing.”
However, these doubts have not dissuaded companies like DNAStar from pursuing a commercial market for short-read de novo sequencing software.
DNAStar’s Steinhauser said that bacterial genomes offer an interesting market segment for the company. “There is a large, large number of people working with bacteria — a big target audience,” he said.
— Julia Karow, editor of In Sequence, contributed to this article.