A genome sequence-assembly "bake-off" organized by a group of US researchers aims to evaluate algorithms that are considered the cream of the crop for large-scale genome assembly.
Organizers of Genome Assembly Gold-Standard Evaluations, or GAGE, describe the effort as "an attempt to produce a realistic assessment of genome assembly software" for short-read sequencing data.
The planning committee comprises researchers from the University of Maryland, Cold Spring Harbor Laboratory, and the National Biodefense Analysis and Countermeasures Center.
Specifically, GAGE will test the merits of the Broad Institute's ALLPATHS-LG, the J. Craig Venter Institute's Celera Assembler, BGI's SOAPdenovo, the European Bioinformatics Institute's Velvet, and the University of Maryland's Contrail.
Earlier this year, Broad published a study that claimed ALLPATHS-LG performed better than BGI's SOAPdenovo, providing results closer to those achieved using reads from capillary-based sequencing technologies (BI 01/07/2011).
GAGE co-organizer Steven Salzberg, director of the center for bioinformatics and computational biology at UMD, told BioInform that while many in the genomics research field are turning to next-generation sequencing to sequence and assemble novel species, "what most of these scientists aren’t aware of is what they will get if they just pay the minimal amount for some amount of DNA-sequence data and then run an assembler."
The final assembly depends on several factors, he noted, including how much data is generated, what assembly software was used, the read length, and the species.
According to Salzberg, GAGE is a way of "essentially educating the community about what the current state of the art can do in assembling genomes entirely from very short reads."
The competition aims to answer researchers' sequence-assembly questions, such as the amount of sequencing coverage needed and what software and parameters to use.
Salzberg added that the organizers plan to publish their results, including protocols and details about the parameters used, so that the findings can be replicated.
The GAGE team will also provide genomics researchers with "recipes" for both sequencing and running the software, as well as details about methods and software, such as UMD's Quake, that can clean up sequencing data to improve the results of the assemblies.
GAGE is the latest in a series of challenges comparing assembly software. Last year, researchers at the University of California at Santa Cruz and Davis organized the Assemblathon, which aimed at evaluating genome-assembly methods by comparing their effectiveness in assembling a synthetic genome and a real genome.
Across the ocean, researchers at the Centro Nacional de Análisis Genómico in Barcelona, Spain, organized the De Novo Genome Assembly Assessment Project, or dnGASP, to perform the same task but only on a synthetic genome (BI 12/10/2011).
Unlike these challenges, however, GAGE is not a community effort; all the assemblies and software comparisons will be completed by members of the organizing committee and their research groups.
"It's not trivial to run an assembler," Salzberg said. "We decided we would be willing to put the work in and we got a lot of outside collaborators who said they would join us on this. It wasn’t a matter of deciding to limit this but we have enough hands now to do the work [and] we don’t really need more people to help with it."
On the GAGE website, the organizers note that the assemblies and comparisons will be conducted by experts who have assembled hundreds of genomes and evaluated assemblies for more than a decade.
Furthermore, the software will be used to assemble real whole-genome shotgun datasets culled from recent sequencing projects. The genomes, all generated on Illumina's sequencing platform, are human, Staphylococcus aureus, the bee species Bombus impatiens, and the Argentine ant Linepithema humile.
Also unlike other challenges, there isn't a set deadline for GAGE, though Salzberg said the competition is expected to end in a few months.
No Perfect Assembly
Even as GAGE aims to evaluate the performance of the handful of gold-standard programs, the results of the Assemblathon indicate that it can be difficult to identify a winner for such assessments.
Last month, 17 teams from seven countries presented their genome assemblies at the Genome Assembly Workshop in Santa Cruz, Calif. The workshop, which was by invitation only, was sponsored by the Genome 10K project.
In a conversation with BioInform this week, Assemblathon organizer Ian Korf, a professor at UC Davis, said that although some genome assemblers did better than others, there wasn't "a perfect assembly," and to "crown any of them the winner at this point is a bit premature."
The dataset for the challenge was a synthetic genome — originally human chromosome 13 — that was artificially evolved over 200 million years.
Korf said that the teams adopted different strategies and in some cases submitted multiple assemblies that varied depending on the parameters used.
"There are actually many different ways to run an assembler and it's not always clear what the best way is," he explained.
While tools such as SOAPdenovo, ALL PATHS-LG, and ABySS did well, Korf said that differences in performance depended in part on the particular metrics, such as N50, being used for the evaluation.
"If you chose any one particular metric you could find an easy winner," he explained. "But if you start to combine them and say, 'What's the winner overall?' it's a difficult question."
In addition to assessing the state of the art for sequence-assembly software, the organizers of the Assemblathon hoped to come up with new metrics to evaluate the effectiveness of computational tools.
Korf said the evaluators did develop some new metrics. For instance, one metric involved mapping random pieces of true genome to the assemblies to see if they contained the sequences. Since the challenge only involved synthetic data, however, these metrics cannot be applied to real genomes. But Korf said it would be possible to do "facsimiles" that could be used to help evaluate real datasets.
At this point, "it's a wide open field," Korf said, "I think there is a lot to explore in this area and there is a chance for a lot of different winners depending on what the genome is."
A second Assemblathon is in the works and its organizers plan to release the datasets by May 1. This time, Korf said they plan to use real genomic data from a parrot and cichlid — a kind of tropical fish — species.
Looking ahead to future challenges, Korf noted that there are other areas that the bioinformatics community could address, including metagenomics, transcriptomics, cancer genomes, and so on.
"The first Assemblathon was really focused on the needs of the GK10 project," he said. "In the future, there are other kinds of genome assembly assessments that need to be done — not just vertebrate genomes. There are insect genomes, plant genomes, [and] a lot of other things that we could branch into."
Real vs. Synthetic Data
Salzberg said GAGE's organizers opted not to use simulated genomes because assemblies based on synthetic data aren't "particularly informative" and can be "misleading."
He conceded that in the early days of doing assemblies, simulated data had its uses because there weren't many finished genomes. However, he added, "in terms of giving you a good picture of what you'll really get when you try to assemble a genome, it's only marginally better than useless."
That’s because, according to Salzberg, in spite of a concerted effort to simulate all types of errors that are likely to crop up in a real dataset, "when you use simulated data, you still get better results than you get with real data."
Furthermore, he said, "assemblers can be tuned to do well on any particular type of error if you know in advance what type of error might be there or if you can guess."
He also said that the reasons often provided in support for synthetic data — no real data is available and that the true genome is unknown — don’t hold true for the GAGE challenge.
"We have real data from the human genome and from a couple of bacteria in our competition, which have been finished to a high standard. The truth is known, so why not use that?" he said.
While Korf acknowledged that synthetic genomes can't incorporate some of the challenges that exist in real datasets, he noted that these genomes are not so simple to assemble, as evidenced by the results of the Assemblathon.
Part of the difficulty — and a distinguishing factor for the Assemblathon, according to Korf — is that the data used for the challenge came from a diploid organism.
Challenges that use real data often use sequence from just one chromosome, he said. "The real problem out there in sequencing organisms is that they are diploids ... their two chromosomes aren’t the same and so if you assemble them and try to make one reference that contains both these things in it, it's actually very difficult to do."
Furthermore, he pointed out that although the field has come far, the true genomes of most organisms still remain unknown.
Korf said that although the organizers had planned to use both real and synthetic data for the Assemblathon, the real data wasn't available in time for participants to do the assemblies and submit their results before the Feb. 1 deadline.
However, Korf said, this wasn’t much of an issue. "It would have been very hard to evaluate how well people assembled the real genome because we didn’t have any way to evaluate that, unless we just used the standard measures, which don’t work.
"Knowing the real answer allowed us to really assess what state of the art was and it was very instructive," he added.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.