Researchers from the St. Petersburg Academic University of the Russian Academy of Sciences and the University of California, San Diego, have developed a method to assess the quality of genome assemblies that they claim improves on existing approaches such as those used for the Assemblathon and the Genome Assembly Gold-Standard Evaluation challenges.
The developers last month published a paper in Bioinformatics explaining how their Quality Assessment Tool, or QUAST, improves on its competition because it can evaluate assemblies that were put together with or without a reference genome and can work with an unlimited number of assemblies at the same time. It also uses a full set of easily understood and interpreted quality metrics such as number of contigs, N50, and GC content; and can evaluate both large and small genomes.
In comparison, methods such as Plantagora — a web-based plant genome assembly simulation platform — and GAGE, which can only be used to evaluate assemblies that have known reference genomes, for example, the researchers wrote.
Also, unlike QUAST, GAGE's approach can only be used on one dataset at a time. As such, in order to compare the performance of multiple assemblers on the same data, users have to "manually combine output from separate GAGE reports into a table," according to Alexey Gurevich, a doctoral student and one of the authors of the Bioinformatics paper.
Furthermore, Gurevich et al explain that QUAST is more flexible than methods used for the Assemblathon challenge which are "highly focused on the genomes used in the competition" and not easily applied to other genomes. It also uses fewer metrics than Assemblathon does — about 30 compared to more than 100.
Gurevich told BioInform that he began developing QUAST in 2011 while he was interning at the algorithmic biology laboratory at St. Petersburg University because his colleagues needed a method to assess the performance of a single-cell genome assembler they'd developed called St. Petersburg genome assembler, or SPAdes, and also to compare its performance to similar programs.
Specifically, they wanted a tool that would let them compare multiple assemblies and didn’t need a reference genome, Gurevich said. They also wanted to be able to "evaluate a full range of metrics needed by various users" and to generate detailed statistics "on each contig of an assembly," he said.
Since existing methods didn’t meet all of the lab's criteria, the researchers decided to develop their own using a combination of metrics from tools such as Plantagora and GAGE, the Bioinformatics paper explains.
Gurevich said that he chose the metrics that went into QUAST by studying several existing genome assembly evaluation tools and selecting the "most popular metrics" used in these methods. He also consulted genome assembler developers to get their input on what metrics were useful for them and incorporated feedback from users of an earlier version of QUAST.
They also developed and incorporated some new metrics in QUAST, according to the paper. One of these is a variation of the N50 metric, which combines the Nx metric and Plantagora's approach for computing the number of misassemblies, the paper states.
QUAST has about five modules in its pipeline. Gurevich explained to BioInform the software begins by computing some basic statistics such as the number of contigs, total assembly length, largest contig, and the number of contigs that are greater than a specific threshold.
Next, if there is a reference genome available for the analysis, QUAST's contig analyzer module aligns the assemblies to the reference and generates a report that provides detailed information about each contig; for example, whether it's unaligned, misassembled, or ambiguous, he said.
QUAST then moves on to its genome analyzer module where it computes metrics such as genome fraction, number of gaps in alignments, and number of genes/operons covered by each assembly, Gurevich said.
In cases where there is no reference to be had, QUAST skips the contig and genome analyzer steps in its pipeline, he said. It evaluates the assemblies using metrics such as the number of contigs, total length, largest contig, N50, GC content, and the number of predicted genes, he said.
There is also an optional gene prediction module that uses GlimmerHMM and GeneMark.hmm to locate genes in eukaryotic and prokaryotic genome assemblies respectively, he said.
Once it completes the analysis, QUAST returns reports that include colorful plots for metrics such as GC content and contig alignment, the paper states. It also generates "comparative histograms of several metrics" including the number of complete genes, the number of complete operons, and the genome fraction, the researchers wrote.
Gurevich told BioInform that the team is currently enhancing QUAST's ability to evaluate metagenomics assemblies. "This means support of multiple references [and] new summary table[s] and plots," he said. The researchers are improving QUAST's web server, which is currently available in beta, he said.
Commenting on QUAST, Pavel Pevzner, a professor of computer science at the University of California, San Diego, described the tool as an extension of methods like GAGE and Assemblathon because of its ability to evaluate assembly quality in the absence of a reference.
Pevzner, who helped develop SPAdes but not QUAST, added that this capability is "particularly important" for projects that involve single-cell sequencing where finished genomes aren't an option.