By Julia Karow
Organizers of the Assemblathon genome assembly competition last week launched the second round of the effort, posting sequence data for three vertebrate genomes generated on two next-gen sequencing platforms. Results from this round of the competition, called Assemblathon 2, are expected in early November.
The aim of the bioinformatics contest, organized by Ian Korf and colleagues at the Genome Center at University of California, Davis, in collaboration with David Haussler's lab at UC Santa Cruz, is to compare genome assembly methods for new types of sequencing data, with an initial focus on the Illumina platform. Participating groups use their own software to assemble a set of sequence data in a limited amount of time.
While Assemblathon 1 involved simulated data for a single genome — an artificially evolved version of human chromosome 13 — using synthetic reads that were modeled after Illumina's data, Assemblathon 2 challenges participants to assemble real data from three unpublished vertebrate genomes that were sequenced either on Illumina alone or a combination of Ilumina and 454 technology.
Future competitions, which might take place annually if not more frequently, will seek to incorporate even more sequencing data types, according to Korf. The results will likely inform how large genome sequencing projects, such as the Genome 10K Project that aims to sequence the genomes of 10,000 vertebrate species, will proceed.
The Assemblathon is one of several projects comparing genome assembly methods, which also include dnGASP, a competition organized by the National Center for Genome Analysis in Barcelona, Spain, and GAGE, a "bake-off" of assemblers run by researchers at the University of Maryland.
Results from Assemblathon 1, in which 17 groups participated, were initially presented in March at a genome assembly workshop at UC Santa Cruz. Assemblathon organizers also presented the results last month at the Biology of Genomes meeting at Cold Spring Harbor Laboratory and have submitted their findings for publication in a scientific journal.
According to Korf, an associate professor of molecular and cellular biology at UC Davis, there was no real winner, though several groups – including the Broad Institute, BGI, and the Wellcome Trust Sanger Institute – did very well. "We had a lot of different ways of measuring how complete a genome is," Korf told In Sequence last week. "Depending on which metrics you use, you could have picked probably 11 different winners. But if you wanted to pick some aggregate, it becomes difficult to figure out who the winner is."
Also, he said, while some assemblers might perform well with an artificial vertebrate genome, others might be better suited for a plant genome, for example. A common theme, he said, was that those groups considered experts in the field performed better than groups with less expertise in genome assembly.
And while some had predicted that an assembly using artificial data would be too easy, that was not really the case. "People did pretty well, but it's not like people did so well that it was a trivially easy problem," Korf said. "It was still a difficult problem."
The reason the first round focused on artificial reads mimicking the Illumina technology is that per base, that was the cheapest sequencing technology available. But that does not necessarily mean it provides the best bang for the buck.
"The people who are paying the bill for sequencing would love to know: If I have $100,000 to spend, which sequencing technology will give me the best genome?" Korf said. "That is a question they can't answer because even the assembly people don't know the answer. All they can know is, 'How many base pairs can I get per dollar?' And if they use that, they are going to say, 'Illumina.' But it might not be the best thing, it might be that that's not a good use of the dollars, and that's the kind of thing we are trying to figure out."
Round Two
Assemblathon 2 requires participants to assemble, within three months, the genomes of three vertebrate species — a fish, a snake, and a bird — from datasets provided to the project by four research groups. While two genomes were sequenced by Illumina technology alone, the third one has datasets generated on both the Illumina and the Roche 454 platforms. Almost 25 groups have so far expressed an interest in participating.
The Broad Institute provided Illumina mate pair and paired-end reads for the Lake Malawi cichlid, a fish; Illumina, as part of a collaboration with Joe DeRisi at UC San Francisco, generated mate pair and paired-end reads for the read-tailed boa constrictor; and China's BGI contributed Illumina data and Erich Jarvis at Duke University 454 data for the common pet parakeet.
Among other things, the competition will allow the researchers to investigate the influence of slight differences in the same sequencing technology. For example, the Broad Institute and BGI construct their Illumina libraries slightly differently, and Illumina's snake data includes read lengths and libraries that are not yet available to Illumina customers.
The parakeet data will allow them to compare assemblies from short Illumina reads and from long 454 reads, as well as from a mix of the data. "It will be very, very interesting to see how the longer [454] reads help," Korf said. "It's not something we really know right now, and some people are not prepared to deal with it. But hopefully, everybody will try."
Korf explained that hybrid assemblies from different data types are tricky because different sequencing platforms have distinct error models. For example, 454 data is quite accurate, except for long homopolymeric runs. Illumina's read quality, on the other hand, decreases with the length of the read.
"These different kinds of error models are going to affect how people assemble [the data], and putting that all into the same program is going to be difficult," Korf said. As a result, some scientists prefer to use a single type of data, "but from a logical point of view, people would like to use all the information they can." This also includes information other than sequence data, for example genetic maps, or similarity with other genomes, he added.
Participating groups have until early September to submit their assemblies. In order to evaluate them, Korf and his colleagues plan to use data generated on other platforms that is currently not available to the participants, including data generated on Pacific Biosciences' single-molecule sequencing platform, optical maps, and fosmid sequences.
The Assemblathon team plans to report the results of the second round at the Genome Informatics conference at Cold Spring Harbor Laboratory in early November.
Late in the planning stage, Korf said, he and his colleagues were also offered data generated on Life Technologies' SOLiD platform but decided not to include it in this round because they did not want to overwhelm participants with more than three genome assemblies. But he said that in future Assemblathons — funding for which is currently pending — they would like to include SOLiD data as well.
The current plan is to have a competition at least every year, which Korf said is necessary because the field is changing so rapidly. "The assembly field changes every six months because there is a new technology out," he said. "As soon as we assess where we are, we have changed where we are."
Have topics you'd like to see covered in In Sequence? Contact the editor at jkarow [at] genomeweb [.] com.