Researchers at the Wellcome Trust Sanger Institute have published a new method for assembling large genomes de novo that requires less compute memory than current assemblers like Velvet and SOAPdenovo.
According to a Genome Research paper describing the method, the secret to the String Graph Assembler's reduced memory footprint is that it uses compressed data structures to "exploit the redundancy" in sequence reads and to "substantially lower the amount of memory required to perform de novo assembly."
SGA relies on an algorithm its developers published in 2010 that constructs an assembly string graph from a so-called “full-text minute-space” index, or FM-index, which enables searching over a compressed representation of a text. Unlike other short-read assemblers that rely on the de Bruijn graph model, which breaks reads up into k-mers, the string graph model “keeps all reads intact and creates a graph from overlaps between reads,” the Sanger team wrote.
SGA “is one of the first assemblers to implement a string graph approach for
assembling short reads and the first assembler to exploit a compressed index of a set of sequence reads,” they added.
Richard Durbin, joint head of human genetics at the institute and co-developer of SGA, explained to BioInform that the approach is a bit of a throwback to Sanger sequencing, since overlap-based string graphs were employed by software used to assemble longer reads produced capillary sequencers.
"We thought it would be good to look back at the overlap-based assembly methods" used to assemble genomes from capillary reads, and "try to make them as efficient as the de Bruijn methods," Durbin explained. He added that this approach makes even more sense now that sequencing instruments like the Pacific Biosciences RS are generating longer reads.
In the paper, Durbin and co-developer Jared Simpson report that SGA successfully assembled 1.2 billion human genome sequence reads using 54 GB of memory. This was compared with SOAPdenovo, which required 118 GB for the same task.
SGA's approach also addresses some of the shortcomings faced by commonly used assemblers such as SOAPdenovo, Velvet, and AbySS that rely on de Bruijn graphs for their assemblies.
The researchers explain in the paper that while de Bruijn assemblers require “complicated read-threading algorithms” to recover genomic data that's lost when reads are broken into k-mers, SGA's string graph model approach avoids these issues altogether by keeping the reads.
Furthermore, it produces contigs that are "highly accurate and contiguous, while covering 95 percent of the reference genome," they wrote.
Results from comparisons of SGA with Velvet, ABySS, and SOAPdenovo using a C. elegans dataset showed that its assembled contigs covered 95.9 percent of the reference genome while the other three programs covered 94.5 percent, 95.6 percent, and 94.8 percent respectively.
Furthermore, SGA required only 4.5 gigabytes of memory to assemble the C. elegans dataset compared to 14.1 GB, 23 GB, and 38.8 GB required for ABySS, Velvet, and SOAPdenovo respectively.
SGA is slower that its counterparts, however. SGA took 1,427 CPU hours to complete a human genome assembly, while SOAPdenovo required 479 CPU hours.
"We explicitly trade off a bit longer CPU time for lower memory usage," Simpson, a doctoral student in Durbin's lab, told BioInform. "We feel that fits better into most of the clusters that are available right now."
On the other hand, SGA is parallelizable, so its most compute-intensive activities — error-correcting reads and building the FM-index of corrected reads — can be distributed across a compute cluster to reduce run time, the researchers explain in the paper.
According to SGA's developers, compression-based sequence analysis algorithms will become "increasingly important" as the number of full genome datasets continues to rise.
The investigators entered SGA into the first Assemblathon challenge hosted by researchers at the University of California, Santa Cruz, and UC Davis last year (BI 12/10/2010). The results of the challenge were published in a separate Genome Research article last September.
Ian Korf, a professor at UC Davis and one of the Assemblathon organizers, told BioInform in an e-mail that SGA was "one of the best" assemblers of the contest.
In the contest, SGA placed third out of seventeen groups, behind the Broad Institute's ALLPATHS-LG and SOAPdenovo, and had the "largest scaffold path N50, the lowest number of substitution errors, and the second lowest number of structural errors," Durbin and Simpson wrote in their paper.
These results bode well for the prospects of overlap-based approaches in the sequence assembly arena.
"I think de Bruijn is conceptually simple and that’s always good," Durbin said. However, an overlap-based approach "has got a conceptual advantage in that it uses the full read length more completely."
He explained that while de bruijn assemblers like Velvet require a separate processing step after completing the genome assembly, SGA doesn’t and potentially avoids "some of the errors or incomplete analysis" that can occur in the extra processing step.
This reduced error risk plus its lower memory requirement ensures that tools like SGA have a "future," Durbin said.
For their next steps, Durbin and Simpson are adapting SGA to work with longer read data from the Roche 454 sequencer and the Life Technologies Ion Torrent Personal Genome Machine. They are also exploring ways of discovering variants using the program.
The approach could also be used to analyze metagenomic data, the researchers said in the paper.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.