Skip to main content
Premium Trial:

Request an Annual Quote

JGI Researcher's Nucleotid.es Aims to be Comprehensive Registry of Genome Assemblers, Benchmarks

Premium

NEW YORK (GenomeWeb) – Nucleotid.es, a publicly available repository developed by a researcher at the Joint Genome Institute, aims to provide a comprehensive list of genome assemblers and associated benchmarks that will help researchers in the genomics community select and use the most appropriate assembly tools for their sequencing projects.

Nucleotide.es is developed and maintained by Michael Barton, a bioinformatics systems analyst at the JGI, who created it to provide a less time-consuming and more automated method of selecting an assembler to use. In the registry, each assembler is shown alongside benchmarks such as NG50, LG50, number of contigs, and number of incorrect bases. These benchmarks are based on testing using an internal JGI dataset of 17 bacterial genomes, all of which have known reference genomes. Barton uses the software to assemble the contigs and then compares the assemblies to the respective reference using Quast, an assembly quality assessment tool.

The assemblers themselves are made available as Docker images — Docker is an open platform for building, sharing, and running applications — with all the necessary dependencies included so that others can run the programs using the same parameters and configurations that the initial authors used to get their results.

Barton believes his resource complements community-wide efforts to evaluate assembly software such as Assemblathon and the Genome Assembly Gold-Standard Evaluations (GAGE). Assemblathon, for example, compares the performance of various assemblers on specified datasets and provides the results but does not provide details on how those results were obtained, he pointed out. It does not, for instance, offer details about what runtime options the original researchers used or whether the reads were processed prior to or after the assembly, he said. GAGE, on the other hand, does include the software recipes used to generate the assemblies so it's possible to reproduce the results, but doesn’t address issues related to installing and running the software, Barton pointed out. Docker images, on the other hand, are simpler to install and run, he added, noting that a user can download and start running the assembler immediately.

Furthermore, both of these challenges provide something of a static view of the software, showing it as it existed when the challenge was initially run, Barton added. Nucleotid.es, on the other hand, is more dynamic. Software developers can make the most recent iterations of their software available as Docker images, then get them benchmarked on the JGI data and made broadly available in the registry. And as new methods are developed and published, they can easily be benchmarked and added to the registry, he said.

Currently, the website has images and benchmarks for the Spades assembler, the Iterative De Bruijn Graph De Novo assembler, Velvet optimizer, Velvet assembler, Assembly by Short Sequences, and the Short Oligonucleotide Analysis Package. It also includes a ranking of the assemblers based on the metrics used for the benchmarks; for example, it lists Spades as the best assembler based on the NG50 metric.

Barton is actively seeking assemblers to test and include in the registry as well as people interested in generating and sending Docker images of their assemblers to him for testing. He is open to external data contributions that can be used to benchmark the assemblers as well as researchers willing to help test the assemblers. So far, the response from members of the community to whom Barton has presented the method has been positive, he told BioInform. "I think people understand the need for solving this problem of not being able to reproduce assemblies or research or, if you are not a bioinformatician, [getting] started with loading … and running an assembler if you don’t have a lot of experience in that area."

Barton is currently working on providing additional data about the assemblers, specifically details about memory usage and run times. He's also adding additional assemblers to the registry and looking at running additional tests with longer reads. Eventually, he plans to expand the repository to include other categories of software such as aligners, he said.

Filed under