Skip to main content
Premium Trial:

Request an Annual Quote

Moving From Simulations to Real Data, Short-Read Assemblers Start Facing Off

Research teams in Switzerland and the US have tested the ability of a number of new assembly algorithms to build bacterial genomes from scratch, according to a pair of papers.
Last week, scientists from Geneva University Hospitals published an article that, for the first time, compared and contrasted four short-read assembly algorithms, including a new one, on experimental short-read datasets from two bacterial genomes sequenced on Illumina’s Genome Analyzer.
Separately last week, researchers at the Broad Institute published a new algorithm and software system for paired-read assembly after testing it on partially simulated paired-read Illumina data.
Both studies, which were published online in Genome Research, are examples of efforts to show that data from short-read sequencing technologies can be used to assemble genomes, or major parts of them, de novo.
“We can do a lot more with short reads than in the past people thought was possible,” said Iain MacCallum, a computational biologist at the Broad Institute and author of one of the studies.
The Swiss researchers, led by David Hernandez, a computer scientist in the genomic research laboratory at the Geneva University Hospitals, used experimental sequencing data to compare for the first time four short-read assembly software applications side by side: Exact De Novo Assembler, or Edena, developed by the Geneva group; Velvet, developed by researchers at the European Bioinformatics Institute; SSAKE, published last year by scientists at the Genome Sciences Center of the British Columbia Cancer Agency; and SHARCGS, published last year by researchers at the Max Planck Institute for Molecular Genetics in Berlin (see table below).
In the comparison, each of the four tools were used to assemble two single-read datasets: almost 4 million unambiguous 35-base-pair reads from Staphylococcus aureus, generated by the Swiss service provider Fasteris, which collaborated on the study; and almost 12 million unambiguous 36-base-pair reads from Helicobacter acinonychis, originally published with the SHARCGS assembler.
Edena and Velvet “clearly performed better than the others” and resulted in fewer and larger contigs, Hernandez told In Sequence last week. In addition, they performed the assemblies in minutes, and on a desktop computer, whereas SSAKE and SHARCGS required more time, and SHARCGS also needed a supercomputer.
The main reason why Edena and Velvet outperformed the others is that “they have the best approach to handle base errors in the reads,” Hernandez explained.
However, even the assembly with the fewest fragments — obtained with Edena in a “non-strict” mode on the H. acininychis data — still has 302 contigs with an N50 value of 14.2 kilobases, and covers only 98 percent of the genome. Nevertheless, the researchers are “very happy with these results,” Hernandez said. “The main surprise was to be able to produce such long contigs from very short read data.”
Depth of coverage is important for the assembly performance, he said. For example, since the paper was submitted, he and his colleagues have increased the coverage of the S. aureus genome from the original 48-fold to 81-fold, cutting the number of contigs of the Edena assembly almost in half, to about 560, while doubling the average contig size to 5 kilobases.
Most of the gaps in the genomes are due to repeat sequences, he said, as well as to small regions in the genome that are not covered well by the Illumina technology, probably because they form secondary structures.
What is important for a good assembly from short reads as well is the size of the target genome, which must not be larger than a few megabases, Hernandez said.
Although he has so far tested Edena only on Illumina data, “there is no reason that it would work differently” on ABI SOLiD data, he said, which he plans to use in the future. Edena’s only requirement is that the reads must all have the same length, he added.
His group now has access to a dataset with paired-end reads as well, but he has not yet implemented a feature that will allow Edena to make use of the paired-end data.
Being able to use paired-end reads will constitute “the main improvement” to Edena in the future, he said.
‘Fundamental Limitations’
Meantime, at the Broad Institute, researchers published details of a new algorithm and software system that already uses paired-read data, setting it apart from the other four tools.
“There are fundamental limitations in how far you can go with unpaired data and assembly,” MacCallum, team leader of the new sequencing assembly group at the Broad, told In Sequence last week. “We use the paired reads to give us extra information to overcome the difficulties that you have in assembling the short-read data.”
The Broad’s algorithm, called Allpaths, was designed as a paired-read assembler from the get-go at a time when paired-end data for Illumina’s platform was not yet available. “We started developing this before we even had a glimmer of getting a working machine,” MacCallum said. 
The paired-read information allows the researchers to assemble small, localized sections of the genome instead of the entire genome at once, MacCallum explained. “Basically, it’s like in silico clone-by-clone sequencing,” he said.

“The main surprise was to be able to produce such long contigs from very short read data.”

Lacking experimental paired-end data, the scientists took experimental 36-base single-read Illumina data from E. coli, which they artificially turned into paired-read data by assigning reads to each other. Using Allpaths, they assembled these pairs and obtained 58 components — which are similar to contigs — with an N50 size of 145 kilobases that covered 99.1 percent of the genome. Those were connected into a single scaffold, with 12 discrepancies from the reference genome.
Since submitting its article, the Broad team has moved on to experimental paired-end Illumina data from bacterial genomes, showing its first results — assemblies of S. aureus, E. coli, and M. tuberculosis — at the Advances in Genome Biology and Technology meeting on Marco Island last month. According to the meeting abstract, they achieved “excellent long-range continuity and very high base accuracy.”

According to MacCallum, Allpaths is “completely agnostic as to the type of short reads,” and could also be used to assemble paired-end data from 454’s sequencer or the ABI SOLiD system, for example.

One challenge that all assembly teams grapple with, he said, is the error profiles of the new sequencing technologies, which are not yet well understood and change constantly as the technologies improve.

“It is very important to know what [the errors] are, but it’s difficult to actually pin it down at this moment,” said MacCallum. “It’s a moving target.”

The team’s next task is to assemble a fungal-sized genome from experimental data, “and we are interested in taking it further if we can,” he said.

Other research groups are also working on new short-read assemblers. Inanc Birol and colleagues at the Genome Sciences Center of the British Columbia Cancer Agency in Vancouver, for example, have developed Assembly By Short Sequences, or ABySS, and used the recent AGBT meeting to present assemblies of single- and paired-read Illumina data from human BAC clones.

ABySS mainly differs from the other algorithms in that it uses a spatial description of sequence data, Birol told In Sequence by e-mail this week. “Most significantly, such a description gives us flexibility to parallelize our assembly,” he said, meaning that it can be scaled up. He and his colleagues are already using ABySS in a production pipeline for bacterial genomes, he said.

Another researcher, Steve Skiena of the State University of New York Stony Brook, has been developing an algorithm that, like Allpaths, uses paired-end data and is targeted for data from both ABI’s SOLiD and Illumina’s Genome Analyzer.

So far the algorithm, called Shorty, has been tested on simulated SOLiD data of bacterial genomes at 55-fold coverage and has generated assemblies with N50 sizes of about 25 kilobases, Skiena told In Sequence by e-mail. He said he plans to complete assemblies on real datasets “in the near future.”

In addition, researchers in Australia have been working on an algorithm that can use short paired-end reads from any system, as long as they are of equal length, to assemble BACs from complex eukaryotic genomes.

“The assembly of reads obtained from complex eukaryotic DNA poses additional problems” compared to bacterial genomes, Mike Imelfort, who is developing the program at the University of Queensland in Brisbane, told In Sequence in an e-mail message. His algorithm’s graph-building method is most similar to Edena’s, he said, and, like Allpaths, it tries to extend “seed” reads.

So far, he has tested the algorithm on simulated datasets of eukaryotic BACs. “However, real BAC sequence data from the SOLiD system exhibits unusual error patterns” that “may just be a feature of our data and not representative of the SOLiD system,” he said.

At the moment, he and his colleagues are “resolving the error model” of the SOLiD reads and plan a “similar analysis using Solexa data very soon.”

“The most important aspect of producing an inexpensive and reliable assembly algorithm is understanding the error patterns of the data and trying to minimize the effect that errors have on the output,” Imelfort said. “We believe that our method is the only one which can assemble complex eukaryote genome sequence, although this is a very fast moving area.”

It is unclear which algorithm will be most widely used in the future, and it may be that a combination of several will yield the best results, according to Skiena. For example, Hernandez and his colleagues obtained the fewest and largest contigs with a combination of Velvet and Edena, he pointed out, showing that “there is more work to be done before anyone can claim to have the ultimate assembler.”

Meantime, Imelfort recommends that scientists review available methods when they generate their sequence data and consider reassembling the data once better methods evolve.

MacCallum agreed that the field has not settled. “De novo assembly from short reads is perhaps something people had thought would not be possible,” he said. “But the slew of papers that have appeared recently suggest otherwise, and hopefully, out of all this, something will appear which will do the job.”

Algorithms and Programs
for De Novo Assembly of Short Reads
Name Developers Publication

Inanc Birol, Steven Jones, et al., Genome Sciences Center, British Columbia Cancer Agency

Allpaths Jonathan Butler, Iain MacCallum, David Jaffe, et al., Broad Institute of MIT and Harvard Genome Res. 2008 Mar 13 [Epub ahead of print]
Edena (Exact DE Novo Assembler) David Hernandez et al., Genomic Research Laboratory, Geneva University Hospitals Genome Res. 2008 Mar 10 [Epub ahead of print]
N/A Michael Imelfort, David Edwards, et al., University of Queensland, Brisbane Unpublished
SHARCGS (SHort read Assembler based on Robust Contig extension for Genome Sequencing) Juliane Dohm, Heinz Himmelbauer, et al., Max-Planck-Institute for Molecular Genetics, Berlin Genome Res. 2007 Nov;17(11):1697-706. Epub 2007 Oct 1.
Shorty Steven Skiena, State University of New York at Stony Brook Under review
SSAKE (Short Sequence Assembly by K-mer search and 3' read Extension) Rene Warren, Robert Holt, et al., Genome Sciences Center, British Columbia Cancer Agency Bioinformatics. 2007 Feb 15;23(4):500-1. Epub 2006 Dec 8.
VCAKE (Verified Consensus Assembly by K-mer Extension) William Jeck, Corbin Jones, et al., University of North Carolina at Chapel Hill Bioinformatics. 2007 Nov 1;23(21):2942-4. Epub 2007 Sep 24.
Velvet Daniel Zerbino and Ewan Birney, European Bioinformatics Institute Genome Res. 2008 Mar 18 [Epub ahead of print]

The Scan

UCLA Team Reports Cost-Effective Liquid Biopsy Approach for Cancer Detection

The researchers report in Nature Communications that their liquid biopsy approach has high specificity in detecting all- and early-stage cancers.

Machine Learning Improves Diagnostic Accuracy of Breast Cancer MRI, Study Shows

Combining machine learning with radiologists' interpretations further increased the diagnostic accuracy of MRIs for breast cancer, a Science Translational Medicine paper finds.

Genome Damage in Neurons Triggers Alzheimer's-Linked Inflammation

Neurons harboring increased DNA double-strand breaks activate microglia to lead to neuroinflammation like that seen in Alzheimer's disease, a new Science Advances study finds.

Long COVID-19 Susceptibility Clues Contained in Blood Plasma Proteome

A longitudinal study in eBioMedicine found weeks-long blood plasma proteome shifts after SARS-CoV-2 infection, along with proteomic signatures that appeared to coincide with long Covid risk.