A new assembly program from the Whitehead Institute may give researchers outside of Celera Genomics their first chance at assembling whole-genome shotgun sequence.
The program, called Arachne, was described in the January 2002 issue of Genome Research. The authors noted that while other assembly programs, such as Phrap, TIGR assembler, Amass, Euler, GigAssembler, and the Celera assembler have been reported in the literature, only Celera’s assembler has been able to handle large and complex eukaryotic genomes so far, making Arachne the first publicly available tool for the job.
Whitehead researcher David Jaffe, an author on the paper, told BioInform that the Sanger Center is also developing a whole-genome shotgun assembler called Phusion, but it is not yet publicly available. The two sequencing centers have been comparing their programs against the same data sets in a friendly competition, Jaffe said, “and we’re learning from each other’s assemblies as we go along.” Both programs are currently being used to assemble the whole-genome shotgun sequence for 5-6X coverage of mouse, which Jaffe said is expected to be publicly available in March.
The key difference between Arachne and other assembly methods is its use of pairing information — paired forward and reverse reads from both ends of plasmid clones — to order and orient unique contigs into longer segments called supercontigs (or scaffolds). Programs such as Phrap do not use this pairing information, Jaffe said, and are too slow to scale to larger data sets. Arachne is similar to Phrap, however, in its use of quality scores to ascertain the accuracy of read alignment.
Whitehead has used Arachne to assemble the 40-megabase genome of the fungus Neurospora crassa and is applying it to its other sequencing projects, including the 400-megabase Tetraodon nigroviridans genome and the 180-megabase Ciona savignyi genome. For Neurospora, the Whitehead researchers compared their assembly to four megabases of independently generated finished sequence and found only two discrepancies (99.996 percent accuracy).
Whitehead demonstrated the feasibility of Arachne for mammalian-sized genomes by producing an initial WGS assembly of 4X coverage of the mouse genome in eight days on a single Alpha processor running at 833 MHz and using less than 24 Gb RAM. The authors noted, however, that while the program should be useful for producing initial WGS assemblies of large genomes, “producing high-quality finished sequences of such genomes will require at least some clone-based sequencing.”
The Arachne software package is freely available from the Whitehead website (www.genome.wi.mit. edu/wga) for Compaq Alpha hardware running Tru64 Unix. Source code is also available (ftp://wolfram.wi.mit.edu/pub/wga/Arachne/Arachne_src.tar.gz).