The results of the second Assemblathon challenge, hosted by researchers at the University of California, Santa Cruz and UC Davis, indicate that although current de novo genome assembly software can produce useful assemblies there is still a lot of room for improvement.
After evaluating 43 assemblies submitted by 21 participating teams, the challenge organizers and evaluators reported in a paper published this week in GigaScience that although many current software platforms produce assemblies "containing a significant representation of their genes and overall genome structure" there is still a "high degree of variability," with some assemblers performing well when evaluated in the context of a single metric but not when measured with multiple ones. They also found that while some assemblers work well with data from one species, they may not do as well with data from others.
This performance inconsistency was one of the major findings of the challenge, Keith Bradnam, a project scientist in the University of California, Davis and one of Assemblathon's organizers, told BioInform. However, this may come as a surprise to many in the genomics community.
"I think a lot of people were hoping and expecting that this paper would come out and say that this is the software that everyone should be using and this is how you use it" but "in a way we say neither of these things … we can't recommend any one piece of software," he said. Furthermore, "it's not always obvious how you use the software because a lot of these entries were submitted by the people who wrote the software so they've got a natural advantage."
The paper does offer some remedies for consistency problems. The researchers recommend that efforts to assemble eukaryote genomes de novo should involve several assemblies generated with different assemblers and/or parameters, and should consider multiple metrics. Researchers should also select assemblers that are known to work well in areas of interest such as coverage, continuity, or number of error-free bases; and they should assess heterozygosity levels in the target genome before beginning the assembly. The topic has also been discussed at length on Twitter and in blogs posts and commentaries written by evaluators and reviewers — a preprint version was made available earlier this year and an open conversation ensued on these forums discussing the results and what they mean for the current state of genome assembly.
In an interview with BioInform, Mick Watson, director of ARK-Genomics at the University of Edinburgh and one of the manuscript's reviewers, praised the open peer review process. "I think it’s a very good idea to have that open interaction" and "I thoroughly recommend this as a model going forward," he said.
Bradnam expressed similar sentiments about the open review stating that he found it "very gratifying" that science could be conducted so plainly. "I don't think the final paper has suffered because of that. [In fact] I think it's only been enhanced that we had this open discussion," he said. "I hope that will be a model for more cases in the future to try this approach."
The submitted assemblies for this round are available from the Assemblathon website and also from GigaDB, as well as the assembled Fosmid sequences for bird and snake that were used to validate assemblies. The source code for scripts used in the challenge is available in Github.
The first Assemblathon began in late 2010 with participants expected to submit their assemblies by February 2011. It is one of several challenges launched within the last few years to evaluate the performance of existing genome assembly pipelines. Other competitions include the Genome Assembly Gold-Standard Evaluations, or GAGE, which aims to evaluate algorithms that are considered to be the state-of-the-art for large-scale genome assembly (BI 4/01/2011). Meanwhile, some other groups are working on new evaluation metrics. One example is FRCbam, a tool that uses feature response curves to compare sequence assembly quality (BI1/11/2013).
The first Assemblathon involved assembling Illumina reads from an unspecified organism and a second dataset composed of a pair of related virtual organisms whose genomes were created using Evolver, a whole-genome sequence evolution simulator developed by researchers at Stanford University (BI 12/10/2010). The results of that challenge were published last September in Genome Research.
For this second round of challenges, participants were asked to assemble the genomes of three vertebrate species — bird, fish, and snake — sequenced on Illumina, Roche 454, and Pacific Biosciences instruments. None of the particular species — a budgerigar Melopsittacus undulatus , a Lake Malawi
cichlid Maylandia zebra, and a boa constrictor Boa
constrictor constrictor — used for the challenge have been sequenced before. Each team was allowed to submit one competitive entry for each of the three species as well as a number of the so-called evaluation assemblies for each species that would be analyzed like competitive entries, "but would not be eligible to be declared as ‘winning’ entries."
After about four months of work, 21 teams submitted a total of 43 assemblies generated using software such as ALLPATHS, Newbler, SGA, and SOAPdenovo. Evaluators used ten metrics – selected from a pool of about 100 – aimed at capturing different facets of genome assembly quality and accuracy. The list includes NG50 scaffold and contig lengths, fosmid coverage and validity, optical data, and REAPR summary scores. In addition, the evaluators also calculated an average rank as well as z-scores for each metric, the paper states.
"Overall, we find that while many assemblers perform well when looking at a single metric, very few assemblers perform consistently when measured by a set of metrics that assess different aspects of an assembly’s quality," they wrote.
Bird assemblies, for instance, "tended to have much longer contigs [and] scaffolds … and had more assemblies that comprised 100 percent or more of the estimated genome size" than the other two species. Bird assemblies also performed better than fish and snake when assessed using optical map data. On the other hand, some metrics indicated that the snake genome on average had the highest scoring assemblies of any of the species, but other metrics indicated that it had the lowest quality of the three.
When the entry genomes were analyzed using the NG50 metrics, the researchers found that within each species, "assemblies displayed a great deal of variation in their total assembly size, and in their contig and scaffold lengths." For example, in one scenario, snake assemblies from two teams had similar scaffold NG50 lengths but different contig NG50 lengths. In another instance, bird assemblies from two teams had similar contig NG50 lengths and different scaffold NG50 lengths.
Other assessments that looked at the size of the assemblies in relation to the estimated genome sizes also showed significant variation. For example, several bird assemblies were larger than the expected genome size for the species with the largest bird assembly containing 167 percent of the previously estimated 1.2 gigabase pairs of sequence. Moreover, a fish assembly from another team contained almost 2.5 times as much DNA as expected — or about 246 percent of the estimated 1.0 Gbp, according to the paper. The larger-than-expected size of these assemblies, the researchers wrote, may be the result of errors in the assembly process, or conversely, they may "represent situations where an assembler has successfully resolved regions of the genome with high heterozygosity into multiple contigs/scaffolds."
The evaluators also measured the presence of highly conserved genes in the assemblies. The paper explains that they looked for the presence of 458 core genes that are present in nearly all eukaryotic genomes and as such could be expected to have orthologs in the newly assembled genomes. Specifically, they tested for "70 [percent] or greater presence of each gene within a single scaffold, as compared to a hidden Markov model for the gene" using a tool called CEGMA.
The analysis revealed that "nearly all of the 458 [genes] were found in at least one assembly" from each species. Overall, however, there were variations in the percentage of genes found in all the assemblies — between 85 and 95 percent. Possible reasons for the variability, the paper states, included "fracturing of a given genic region across multiple scaffolds within an assembly, exons lying in gaps within a single scaffold" as well as the likelihood that in some cases CEGMA detected a paralog and not a true ortholog.
Other metrics used to assess the accuracy of the assemblies included looking at fosmid sequences in bird and snake data — this metric wasn’t used for the fish assemblies since there is no fosmid sequence data available for the species. Specifically, the evaluators ran their assessments using sequences extracted from "validated fosmid regions" and a tool called COMPASS to calculate four metrics – coverage, validity, multiplicity, and parsimony. Based on these metrics, the Newbler bird assembly performed the best with the "highest levels of coverage and validity, and lowest values for multiplicity and parsimony among all competitive bird assemblies." On the snake side, four assemblies did well with high coverage and validity values, the paper states.
Assemblies were also evaluated based on optical maps – restriction maps from individual genomic DNA molecules that are assembled, de novo, into physical maps spanning entire genomes – constructed for all three species to validate the long- and short-range accuracy of the scaffolds. Assemblies with scaffolds at least 300 Kbp in length were assessed in terms of two global alignment categories — ‘restrictive’ or ‘permissive’ — indicating concordance between the optical map and the scaffolds and "minor problems in the scaffolds" respectively — and one local alignment category which "represents regions of the scaffolds that may reflect bad joins or chimeric sequences."
As in other cases, results based on this metric varied. SGA's bird assembly, for instance, had high amounts of restrictive coverage but was ranked eighth overall because it did not perform as well in two other alignment categories. Meanwhile, another assembly used by a team referred to as MLK "ranked last in terms of the total length of usable scaffold sequence" but placed second "based on the percentage of input sequence that can be aligned to the optical map."
Assessors also evaluated assembly quality with a tool called REAPR, which uses "remapped paired-end reads to produce a range of metrics at each base of the assembly." Specifically, it uses "Illumina reads from a short fragment library to measure local errors such as SNPs or small insertions or deletions," as well as reads from a large fragment library to "locate structural errors such as translocations" and to "detect incorrect scaffolds," the paper explains. At the end of its run, it generates summary scores for each assembly.
The scores generated for the Assemblathon entries again showed significant differences in the quality of different assemblies, with snake genomes scoring higher than both bird and fish. The paper does note that that REAPR did not use all the challenge libraries made available for each species. This means that any assemblies that were optimized to work with sequences from a library not chosen for the evaluation may have been penalized by the software, which could account for the variations in the results.
Besides reporting the assessment results, the paper also comments on the metrics used for the evaluations. The researchers report that although there are "strong, assembly-specific correlations between various metrics, many of these are not shared between different assemblies," suggesting "that it is difficult to generalize from one assembly problem to another." Furthermore, heat-map data and a parallel coordinate mosaic plot used to compare the assemblies revealed "clearly weaker outliers for each species, and that there are few assemblies which are particularly strong across all key metrics," they said.
Also, some metrics may not be the best ones for particular use cases, the researchers wrote. For example, NG50 and the related N50 metric may be "poor predictors of the suitability of an assembly for gene-finding purposes." In one case, the NG50 scaffold length of one of the bird assemblies was "the third lowest ranked assembly" even though it comprised "99.2 percent of the estimated genome size in scaffolds that are at least 25 Kbp" – the approximate length of an average vertebrate gene. Similarly, one of the snake assemblies had the second lowest NG50 scaffold length although it contained 80.3 percent of the estimated genome size in "gene-sized length scaffolds."
For the moment, there are no firm plans for a third Assemblathon, UC-Davis' Bradnam said, and it may be worth waiting a while before launching another to allow developers to modify and improve their tools, he said. However, the organizers are mulling some changes for the next time around. For instance, teams were required to submit their data in the traditional FASTA file format, but a future challenge might involve trying another file format such as FASTG, which is being developed by the Broad Institute, he said. Also, a new challenge could include assembling plant species whose sheer size and complexity pose a significant challenge for genome assembly algorithms, he said.
Other possible future changes include an assessment of the time and computation required to run the assembly algorithms. "Someone may create the best genome assembly program in the world but if it can only be run on an incredibly large computer cluster, which the average scientists doesn’t have access to, [then] it's not very useful," Bradnam said. The organizers could also restrict the amount of sequence data that participants can access and use. This could involve giving groups "virtual budgets" that they would use to shop for the sequence data they need for their projects, he said. A feature that would be more in keeping with the reality where limited access to funds controls the amount of sequence and compute power that most labs have at their disposal, he added.
It might also be wise to reframe the goals of competition. One of the challenges of the paper, according to Watson, is that while it's "a good technical description of a process that happened, which has value," it isn't clear about the exact focus of the challenge.
"It’s very simple to think, well we have a lot of sequencing data, we'll throw a lot of assemblers at it and we'll publish the results — but what were they trying to achieve? What was the question? … Were they saying ‘use this assembler for this type of data and use that assembler for that type of data’? What was the research question?” he posited. "With things like Assemblathon, there are so many people involved, it's quite hard to keep a defined focus."
Furthermore, a growing number of researchers now think that there may not be one single assembler that works for every genome, he added. For example, recently researchers published their strategy for assembling the 20-Gb Spruce genome "When you take all of the stuff that was done for the spruce genome, it might not necessarily be relevant to the wheat or some other genomes without making some tweaks”, he said. In that sense, "what Assemblathon is trying to reach is perhaps not quite achievable."