As sequencing technologies change, a whole host of software — genome assembly software, to name one category — has to change with them. To assemble a genome correctly, researchers have to have the right software, and the choice of which program to use often depends on the genome itself, as well as which technology was used to sequence it. "Sometimes the assembler that's the best for one genome isn't the best for another genome," says the University of Maryland's Steven Salzberg.
Salzberg's team is constantly evaluating genome assembly software and assembling different genomes. "We do it for various collaborators around the country and around the world, and we have contributed to the development of some assemblers," he says. "We try to use whichever one is best, so we don't really stick with just one favorite. We like to be agnostic about it and we like to be as expert as we can in how to run all of them."
Salzberg and his team have organized what he calls a genome assembly "bake-off" — the Genome Assembly Gold-Standard Evaluations, or GAGE — to compare the efficiency, accuracy, and viability of various genome assembly software packages for a variety of genomes. Unlike some other genome assembly evaluations, however, GAGE is run by experts in the field who routinely assemble genomes. "First of all, it takes a lot of work to assemble a genome — you have to have enough computing resources, but it's more than that. You have to have some awareness of various assemblers' strengths and weaknesses. It's certainly not a trivial task," Salzberg says. As such, GAGE is not open to participants other than Salzberg and his team, although the data will be made public for anyone who wishes to replicate them.
In addition, he says, running an assembler on simulated data — as other evaluation efforts do — doesn't really produce useful results. "You really have to look at real data," Salzberg adds. "We think the right way to compare assemblers is to use them on real genomes, and to use genomes of different sizes and different parts of the phylogenetic spectrum — there's no way to make a simulated data set that would capture all the variables." The GAGE team will use several different genomes — Staphylococcus aureus, chromosome 14 of the human genome, and a species of bee. The group is also assembling the genome of the Argentine ant, Linepithema humile, which was published in PNAS in January.
Salzberg's team will compare the four leading assembly packages — a Celera assembler, Allpaths-LG, SOAPdenovo, and Velvet. Each sequence will be run more than once, in order for the researchers to get a feel for the software programs and to get the most accurate and complete assembly possible. And evaluations like this should be done on a constant basis, Salzberg says, in order to keep up with the changes in sequencing technologies.
Evaluations like GAGE are also a useful exercise to remind the biology community of the work that goes into assembling a genome, Salzberg says. "I want the biology community to be aware of what kind of assemblies they can get and what kind of work is involved, so they don't get the impression that you can spend $10,000 and my genome will be basically assembled," he adds. "You can spend $10,000 and you can get a lot of short reads, but producing an assembly will cost you a lot more than that. It's a good reality check."