Researchers in the computational research and development arm of the Broad Institute have developed a short-read genome assembly algorithm capable of assembling large genomes de novo that they say performs better than BGI's SOAPdenovo, which is currently the standard for such assemblies.
Furthermore the assemblies generated by the algorithm, dubbed ALLPATHS-LG, are closer to the results achieved using reads from capillary-based sequencing technologies, the Broad team said.
A paper describing the tool as well as details of its use in mouse and human genome assemblies was published online in the Proceedings of the National Academy of Sciences last month.
When the program was used to assemble human and mouse genomes sequenced on the Illumina platform, the team reported that "the resulting draft genome assemblies have good accuracy, short-range contiguity, long-range connectivity, and coverage of the genome."
Specifically, the authors wrote that when their algorithm was used to assemble a human genome, the assembly had a contig length of 24 kilobases, which is about 4-fold longer than SOAPdenovo's at 5.5kb, and a scaffold length of 11.5 megabases, which is about 25-fold longer than SOAPdenovo's at 0.4Mb.
Moreover, ALLPATHS-LG covers about 91 percent of the reference genome, which approaches the 96 percent coverage achieved with Sanger sequencing and the Celera Assembler. By comparison, the SOAPdenovo assembly of short-read data covered about 74 percent of the reference genome.
The differences were less clear in the mouse genome assemblies. Both algorithms' assemblies had the same contig length of 16 kb, but ALLPATHS-LG had a scaffold length of 7.2Mb, outperforming SOAPdenovo's 0.3Mb; and covered nearly 89 percent of the genome while SOAPdenovo's assembly covered 86 percent.
One area for improvement is in identifying segmental duplications. Currently ALLPATHS-LG caught only 40 percent of these in both the mouse and human genomes, which is an improvement on SOAPdenovo's 12 percent coverage, but still falls short of 62 percent coverage for the Sanger/Celera assembly.
Another shortcoming for ALLPATHS-LG is the time required to perform the analysis. While SOAPdenovo performed the assemblies on a Dell computer with 48 processors and 512 GB of memory in three days, ALLPATHS-LG took about three weeks on the same hardware. "We anticipate that with algorithmic improvements, [ALLPATHS-LG] can be speeded up, although there may be a trade-off between speed and accuracy," the authors wrote.
ALLPATHS-LG is an extension of the ALLPATHS program developed at the Broad, which was originally developed in 2008 for use in assembling smaller genomes (BI 3/21/2008). The "LG" tag on the new version stands for "long genomes."
The researchers wrote that for the new incarnation of ALLPATHS, they included improvements that made it "more resilient to repeats" and they engineered the algorithm to "economize the data structures" and use "shared memory parallelization" among other changes.
ALLPATHS-LG's release is timely as some researchers have expressed doubts about the possibility of generating good quality assemblies using current short-read data.
Although next-generation sequencing technologies have made it easier to sequence entire genomes at a per-base cost that’s much lower than Sanger sequencing, the shorter read length has become a thorn in the side of researchers attempting to reassemble the genomes.
In fact, scientists in both the US and Europe recently launched parallel challenges that are in part aimed at evaluating computational methods for assembling genomes de novo (BI 12/10/2010).
A recent paper published in Nature Methods by researchers at the University of Washington and Howard Hughes Medical Institute gave voice to the frustrations of current assembly methods. The researchers wrote that when they compared assemblies of human and mouse genomes generated using SOAPdenovo to an experimentally validated reference genome, they found that the assemblies were about 16 percent shorter than the reference and that about 420 megabase pairs of common repeats and about 99 percent of validated duplicated sequences were missing.
The Broad team states in its paper, however, that ALLPATHS-LG offers hope that "considerably better assemblies can be achieved, through improvements in both algorithms and data."
David Jaffe, the director of computational research and development in Broad's genome sequencing and analysis program and one of the authors of the paper, expressed similar sentiments to BioInform.
"We have shown that while it's true at least by this one particular metric [segmental duplication] that we are not as good as the old very expensive assemblies, we are within striking distance," he said.
Jaffe said that his team next plans to focus on improving the quality of the data used in the assemblies. He pointed out that while assembly errors are often a mix of different problems, they are commonly caused by poor sequence quality.
He said that his team works closely with the Broad's molecular biology group with the goal of "getting the data to be the best possible for genome assembly" and that the "end deliverable would be protocols that other people could use."
As an example, he said that his team uses a "recipe" that is described in the paper for sequencing genomes, which although it "isn't necessarily optimized, its something we found that works."
Jaffe said the team also plans to focus on making the algorithm as "usable as possible" so that researchers who aren’t "assembly experts" can use it.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.