CHICAGO – Computational biologists are trying to improve the evaluation of bioinformatics technology by promoting continuous benchmarking of software to supersede static benchmarking tools.
An early effort, still in development and testing phases, is Renku, a free, open-source software platform that includes version control, continuous integration and continuous delivery (often abbreviated as CI/CD), and containerization. The tool features a knowledge graph that tracks input, coding, and output of workflows that automatically updates as new data is introduced.
Renku is a project of the Swiss Data Science Center. It is being touted by Mark Robinson and colleagues at the Swiss Institute of Bioinformatics and the University of Zurich, though they were not directly involved in its creation. However, Robinson noted, they are "somewhat of a power user" of Renku. As such, he added, "we're digging right into some of the technical details of the functionality that we need for our project, and it's been helping us a long time."
That project is to apply Renku to open, continuous benchmarking of bioinformatics software. Without continuous updating, benchmarks can quickly become outdated.
Anthony Sonrel, a doctoral researcher in statistical bioinformatics at the University of Zurich, is working with Almut Lütge, a Ph.D. student in Robinson's group, to test and improve on the Renku benchmarking platform. Sonrel said that the Renku framework includes data components, the methods to be evaluated, metrics for evaluation, and a dashboard to display results.
Other benchmarking tools do exist. "The problem is that when benchmarks are done, it's always a snapshot, it's always at a certain time point, and then it's very difficult to make it continuous," Robinson said. Additionally, earlier benchmarking tools such as Omnibenchmark, PipeComp, CellBench, and Open Problems were difficult to extend to include new methods to evaluate, he said.
Robinson said that authors of other benchmarking applications also "make all the decisions" on how to rank methods. "This app allows the user to make those choices," he said.
However, the senior author of a recent Association of Biomolecular Resource Facilities study on benchmarking of next-generation sequencing platforms told GenomeWeb last month that data from that study could be applied to bioinformatics tools.
Robinson was among the developers of PipeComp, which evaluates computational pipelines.
Robinson said that he also sees potential in Renku to manage collaborative data analysis projects in the future. Renku is also serving as a teaching tool at SIB because the technology helps students learn about software like R Studio and Python without having to worry about all the different installations of those software packages at the institute.
Robinson and colleagues presented an abstract at the virtual Intelligent Systems for Molecular Biology and European Conference on Computational Biology conference in July and offered more details in a subsequent interview with GenomeWeb.
"Currently, evaluations are made in fits and starts, and any method developed in the future will make current benchmarks incomplete or even deprecated," the abstract said. "Furthermore, dependency of benchmarking conclusions on method parameter settings constitutes an additional burden for the scientific community that struggles to find a consensus approach."
Robinson and colleagues called Renku a "lightweight entry point for the scientific community to engage and extend a benchmark while delivering the most up-to-date recommendation."
The Swiss researchers applied RNA sequencing as a proof of concept for the technology. "[T]he fast pace of new method development [in RNA-seq] illustrates the outlined limitations of the current benchmarking approaches," they wrote in the poster.
Robinson counted more than 1,000 informatics applications currently available to analyze RNA-seq data.
Sonrel said that the Zurich researchers identified about 64 different benchmarks in single-cell RNA-seq, with significant overlap among those benchmarks. "For instance, we have seven different benchmarks on clustering, seven different benchmarks on differential expressions, and seven different benchmarks on dimension reduction, each one having a different approach because we don't have any ... standards," he said.
Sonrel called Renku a "platform for reproducible science and reproducible analysis," featuring Jupyter notebooks, Git version control, Docker for packaging apps into containers, and Kubernetes for managing and deploying Docker containers. Each metric, model, and block of data is a different report that can be compartmentalized, Sonrel explained.
"Everything will be connected to the rest of the methods and the metrics from [the Renku] knowledge graph so you don't have to have one single Docker for all of your benchmarking framework and have tons of dependencies," Sonrel said. Adding a component such as a data repository is simple, and the new information is then evaluated along with existing parts.
Each user can have personalized interactive sessions and versioning of the underlying code. "But I guess the biggest advantage of Renku is this knowledge graph … which tracks every input, every output, every code which is used," Sonrel said. "Each time you have an output, you know which script uses it with which parameters, when it was done, and so forth."
Renku offers templates for scripts to integrate the new datasets as well as for reports, and results update automatically. "If the metadevelopers of benchmarkers want to add any new components to the framework, they can simply bring code," Sonrel said.
"This is a constantly updating system," Robinson said.
Serghei Mangul, a computational biologist at the Institute for Quantitative and Computational Biosciences at the University of California, Los Angeles, said that benchmarking can fall into a "self-assessment trap," meaning that those who develop a tool are often the ones who evaluate it. This, he said, creates bias toward developing applications where the benchmarking tool is strongest and often leads creators to neglect weaknesses.
Mangul said that a strong benchmarking tool would establish the "ground truth," a term popular in statistics and machine learning for describing accuracy. Sometimes, he said, benchmarking apps that rely on simulated data are unable to find a ground truth.
"[Even] if you do a great job benchmarking, if you're lacking the ground truth, like you only have simulated ground truth, your results are relevant, but let's say they're not ideal," Mangul said.
He likes Renku's method of using a knowledge graph to benchmark, as well as the continuous benchmarking. "That's probably the best setup you can make for your benchmark," said Mangul, who has done some preliminary work with Robinson on whether continuous benchmarking was feasible, though they have not yet collaborated on Renku.
Mangul was the corresponding author of a review article of benchmarking omics computational tools that appeared in Nature Communications in 2019. That article laid out some parameters of what should be in a good benchmarking tool.
He saluted the Swiss researchers for even making the effort to raise the issue of benchmarking, but said that proper benchmarking of bioinformatics technology could take years. "We need to do a good job," Mangul said.
Robinson said Renku is still in the prototype phase, not yet ready for a general release to the bioinformatics community, and expects it to be a "big, all-encompassing project." He has applied for a grant to conduct a five-year study.
"The main point right now is to get a bit of feedback, get an idea of how the community would take this up," Robinson said.
"We're kind of in a startup phase, so there's still more to come here," Robinson said.
The main points of benchmarking are to make sure that bioinformatics tools will process data in a consistent manner as well as to find weaknesses in current methods.
"The ultimate goal is, I guess, not the benchmark itself," Robinson explained. "It's more to get improved methods out to people analyzing their data. It's kind of a steppingstone to method improvement."