Skip to main content
Premium Trial:

Request an Annual Quote

US, European Teams Launch Parallel Challenges to Improve Computational Methods for Genome Assembly


By Uduak Grace Thomas

Scientists in the US and Europe are organizing two separate community challenges aimed at evaluating computational methods for assembling genomes sequenced on next-generation sequencing platforms, either de novo or by mapping them to a reference.

The Assemblathon, hosted by researchers at the University of California, Santa Cruz, and UC Davis, aims to evaluate genome assembly methods by comparing their effectiveness in assembling a synthetic genome and a real genome.

Similarly, researchers at the Centro Nacional de Análisis Genómico in Barcelona, Spain, are organizing the De Novo Genome Assembly Assessment Project, or dnGASP, to perform the same task using a synthetic genome.

Similar challenges have kicked off recently with the aim of evaluating other computational tools for sequence analysis. Recently, scientists at UC Berkeley and the University of Maryland launched the Critical Assessment of Genome Interpretation to evaluate the effectiveness of computational methods used to make predictions about the impact of genomic variants on phenotypes (BI 11/12/2010).

David Haussler, a UCSC professor of biomolecular engineering and an Assemblathon organizer, explained to BioInform that the challenge is designed to address current issues in the sequence assembly arena as well as to identify tools and algorithms that best meet researchers' demands.

"We are looking for more contiguity," he said. "Researchers want … [to] have scaffolds that cover the large majority of the genes … we don’t want genes that are broken up into different contigs that are not related in the assembly."

Ultimately, he said, "we would love to get to the point where we have not more than several dozen scaffolds per chromosome," but he noted that working with short reads produced on the Illumina and Life Technologies SOLiD platforms makes this challenging to achieve.

Haussler added that understanding the "gene order along the chromosome" is a second key area for the community to address via improved assembly tools.

"We would like scaffolds that span large parts of the chromosome … [including] multiple genes within the chromosome," he said. "Ideally, for vertebrate genomes … we want the size of the scaffolds to be on the megabase range, typically."

Haussler said that the challenge will provide a better understanding of the community's pulse in terms of "how sophisticated are we at using the technology that’s currently available to generate reads [and] how accurately can the most accurate programs assemble these genomes from a novel vertebrate species."

Assemblathon: Real and Synthetic Genomes

Assemblathon participants will have access to two datasets comprising a total of three genome sequences.

The first dataset will be a set of Illumina reads from an unspecified organism. In this challenge, participants will be expected to use their methods to assemble the real genome from scratch.

The second dataset contains a pair of related "virtual organisms" whose genomes were artificially evolved using Evolver, a whole-genome sequence evolution simulator developed by researchers at Stanford University.

One of the synthetic genomes is fully assembled and participants can choose to assemble the second genome from scratch or use the assembled sequence data as a reference.

All three datasets were generated by research groups at UCSC, UC San Francisco, UC San Diego, and the National Cancer Institute's Center for Cancer Research.

Data for both synthetic genomes is currently available on the Assemblathon website, but as of the time this report was filed, the real genome dataset had not been posted.

Assemblathon participants are expected to submit their assemblies for all datasets by Feb. 1, 2011.

The submissions will be evaluated by a group of assessors from the UC Davis Genome Center including Ian Korf, a professor at UC Davis.

Korf, who is also one of the challenge organizers, told BioInform that the predictions will be assessed using a variety of current metrics but that part of the evaluative process will involve identifying new methods for evaluating the effectiveness of the computational tools.

Assemblathon results will be presented at the Genome Assembly Workshop that will be held in March 2011 in Santa Cruz, Calif. The workshop, which is by invitation only, is sponsored by the Genome 10K project.

dnGASP: A Simulated Genome

The dnGASP project, meantime, is one component of the Sequence Mapping and Assembly Assessment Project, which is a collaborative effort among researchers to compare and evaluate methods and strategies for de novo genome assembly and for RNA-seq read alignment.

The RNA sequence alignment challenge, now in its third round, is organized by the Wellcome Trust Sanger Institute and the Center for Genomic Regulation in Barcelona, Spain.

Participants in dnGASP are provided with a simulated genome and a set of reads to assemble. Submissions will be due sometime before the end of this year.

Submissions will be evaluated using the standard measures of assembly quality such as N50, N90, and largest contig size. Additionally, the assessors will measure the ability to bridge different types of repeats varying in size, total length, copy number within the genome, and amount of variation. They also plan to analyze other factors that might impact assembly quality such as the SNP rate.

Organizers of dnGASP could not be reached to provide additional details.

The results of dnGASP will be presented at a workshop in Barcelona in April next year, organized and hosted by the International Center for Scientific Debate.

'The Heart of the Matter'

Haussler said that the Assemblathon organizers selected two datasets "that get at the heart of the matter" because they anticipate that only a few teams "can seriously accept" the challenge.

"If we provided a dozen different challenge problems, there is a good chance that each team would pick a different challenge problem and then there is no chance for comparison," he said. "Simple is better often … if the instructions are to download this sequence, run your assembler, give us the answer, then that encourages people, too."

In addition to figuring out which genome assembly programs work best for research needs, as well as ensuring that new and old methods can keep up with new sequencing technologies, Korf said that the challenge aims to identify new metrics that can be used to "assess genome completeness."

Korf and his colleagues note on Assemblathon's website that it isn't always easy to define which programs produce the best assemblies because an assembler that might work well for assembling a high-repeat-content genome, for example, might not do as well in other situations.

As a result, one of the goals for the Assemblathon is to come up with new metrics to assess the quality of genome assemblies that complement existing ones such as N50 contig size, average contig length, number of contigs, and so on.

Furthermore, the synthetic genomes project is expected to provide a clearer picture of how well the computational tools perform, since the complete sequence assembly is known and can be compared to the predicted assemblies.

The reads for the in silico assembly were derived from a mixture of paired reads and mate pairs and were designed to simulate real Illumina reads.

The synthetic genomes were generated using Evolver, a suite of programs developed by Robert Edgar and colleagues at Stanford University to simulate the evolution of genomic sequences over time.

Dent Earl, a PhD student in Haussler's lab at UCSC who worked on creating the synthetic genomes, explained to BioInform that the team used a portion of a well-known genome to generate the initial genome sequence as well as other features such as a mobile element library and annotations like untranslated regions, tandem repeats, exons, and CpG islands

He explained that Evolver generates a constraint model for the genome based on the "annotations of the infile set, and this constraint model evolves through the course of the simulation along with the rest of the genome."

The simulation is broken up into a sequence of steps, with each step containing an inter-chromosomal and an intra-chromosomal module.

In the inter-chromosomal module, Evolver proposes large-scale events —such as reciprocal translocations between chromosomes — that could take place at different locations in the chromosome, while the intra-chromosomal step handles things like base-level substitutions and insertions and deletions.

Earl said that once the genome development team entered the genome sequence for the challenge, as well as a phylogenetic tree in the Newick format, a set of python scripts that he wrote "interact with our cluster batch-queing system in order to run all of the Evolver programs with the proper input and output, in the proper order, and with as much parallelism as is possible.".

Once the simulation was complete, Evolver provided genomes for the species at the end of the tree as well as for species at each step of the tree all the way back to the original input genome sequence.

The real genomic data will come from an unspecified, previously unsequenced organism. Joe DeRisi, a professor at UC San Francisco, is leading the effort to generate this dataset.

For future Assemblathons, Korf hopes to incorporate data from sequencing technologies other than Illumina, such as Pacific Bioscience's platform, as well as to provide more than just two sets of data.

He also plans to apply for funding for future challenges.

Haussler anticipates that a major discussion point at the workshop in April will address whether the community needs better algorithms or better data.

"If the data is just not enough to assemble consistent scaffolds over a long section of chromosome then we need to have technology that gives us longer reads or some kind of other scaffolding information," he said, adding that there "will be a big debate about what that other technology might be."

Furthermore, he said, the challenge will reveal the community's strengths and weaknesses: "What in general are our algorithms able to do well and where are they falling down?"

Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.