Researchers from Sweden’s KTH Royal Institute of Technology, New York University, and Cold Spring Harbor Laboratory have developed a tool for evaluating de novo sequence assemblers and assemblies that doesn’t require a reference genome.
The researchers published a paper describing the method last month in PLOS One.
Francesco Vezzi, a postdoctoral researcher in the School of Computer Science and Communication at KTH Royal Institute of Technology, told BioInform that the tool, dubbed FRCbam, complements recent efforts to evaluate de novo assemblers like GAGE (Genome Assembly Gold-standard Evaluations) and Assemblathon by using a much broader set of features for determining which assemblers perform better.
More generally, it could help researchers make a decision about which assembler from the more than 20 available to use for their projects, he said.
In the paper, the developers explain that their method does not require the use of a reference dataset — usually required for evaluating de novo assemblies — nor does it use traditional metrics such as NG50, which is a poor predictor of quality, according to the authors.
Instead, FRCbam evaluates the quality of the assemblies and, by extension, the performance of the assemblers by using features such the position and orientation of reads in order to identify possible mis-assemblies, the researchers wrote.
These features, which were identified in an earlier study from another research group published in Genome Biology in 2008, are used to compute a feature response curve, which is a metric developed previously by two members of the FRCbam team to compare sequence assembly quality.
This method uses read-layout information — a file describing the positions and orientations of each read — instead of reference data, the researchers said.
A full description of FRCurves is available in a separate PLOS One paper published in 2011. That paper explains that the FRCurve “emphasizes how well an assembler exploits the relation between incorrectly-assembled contigs against gaps in assembly, when all other parameters [such as] read-length, sequencing error, [and] depth are held constant.”
This helps the tool better “capture the trade-off between contigs’ size and quality” compared to more commonly used metrics, which “emphasize contig size while poorly capturing assembly quality,” the researchers wrote.
Vezzi explained that while FRCurves could assess the quality of an assembly, the approach was based on the availability of layout files, which limited its application to “overlap-layout-consensus-based assemblers” used for Sanger sequence data — not for de Bruijn-graph-based assemblers, which are more commonly used for next-gen sequencing assembly.
This, according to Vezzi, led to the current PLOS One paper, which explains how FRCbam “extend[s] the FRCurve approach to cases where layout information may have been obscured” and expands its “applicability to a much wider class of assemblers.”
“We have extracted the layout information by mapping reads back to the assemblies,” he explained to BioInform. “In this way we are able to approximate layout and therefore compute all our features and the statistics.”
Furthermore, “we are also able to compute features that are highly connected and highly dependent on the data that we are using,” Vezzi said. So for Illumina data “we can develop features that are consistent and predictive for Illumina [sequencers],” he said.
In the future, “we are planning to develop features for optical maps for OpGen [data] and also for [Pacific Biosciences] reads,” he said.
In the current PLOS One paper, the authors evaluate FRCbam using assemblers and data from two competitions — Assemblathon (BI 12/10/2010) and GAGE (BI 4/1/2011). They use the method to rank assemblies provided by programs such as Allpaths-LG, SOAPdenovo, Velvet, and Bambus2.
A comparison of their analysis of the software and data used in the challenges with the reported results showed that FRCbam could easily “separate the best assemblies from the worst ones,” the researchers wrote.
According to the paper, the team tested FRCbam on five real and simulated datasets — Staphylococcus aureus, Rhodobacter sphaeroides, and human chromosome 14 from GAGE; and data from simulated genomes and Boa constrictor from Assemblathon 1 and 2 — all comprised of Illumina paired-end and mate-pair read libraries.
They looked at features such as low and high read coverage areas for all aligned pairs, low and high paired-read coverage areas for properly aligned pairs, as well as high number of paired-end library reads with with pair mapped in a different contig or scaffold.
Among other findings, the researchers reported that for the most part, their ranking of assemblers using FRCurves “was close” to the reported results for Assemblathon 1. Their list of the five best assemblers included tools such as Allpaths-LG and SOAPdenovo, which is consistent with the results published for Assemblathon 1.
They also observed problems in the ability of Allpaths-LG and another assembler used by Portugal’s Center for Research in Advanced Computing Systems to compute copy number statistics — points that were also noted by Assemblathon’s evaluation of both tools for that metric.
In the case of the GAGE data, FRCurve analysis showed that Allpaths-LG and the University of Maryland’s MSR-CA assemblers performed the best, which largely jibes with findings reported by GAGE’s organizers. They also reported that Bambus2 did well in the challenge, which is consistent with GAGE’s results.
Commenting on the method, Adam Phillippy, a researcher at the University of Maryland’s Center for Bioinformatics and Computational Biology, highlighted the use of FRCurves as one of the strengths of the approach.
It “encapsulates different measures of quality in a single curve,” thus providing “a simple quality summary that is useful for comparing multiple assemblies or tuning assembly software to produce the best output for a particular genome,” he said in an email.
Phillippy, who was one of the researchers involved in the Genome Biology study on mis-assembly features, also drew comparisons between FRCbam and a method he co-developed called AMOSvalidate, which also looks for evidence of mis-assembly.
“AMOSvalidate required the position of each read in the assembly to be known [but] many modern assemblers do not report this read placement information, making our tool somewhat obsolete,” he said. “In contrast, the FRCbam approach reconstructs this information by aligning the original reads back to the assembly to estimate their placement.”
As a result, “it reports many of the same features as AMOSvalidate, but in a much more flexible way,” he said.
Overall, based on the results reported by Vezzi et al., Phillippy believes that the FRCbam could gain some traction in the genomics community.
“Vezzi and colleagues report that their FRCbam method generates results that are largely in agreement with reference-based evaluations, without the use of a reference,” he said. “Because of this flexibility, I expect it to be a very useful tool for both assembly developers and genomics researchers.”
According to the current PLOS One paper, the method is currently being used to evaluate the spruce genome assembly.