NEW YORK – Researchers looking for circular RNA (circRNA) biomarkers in RNA-seq datasets should consider using an ensemble of bioinformatics tools, according to a recent benchmarking study.
Led by researchers from Belgium's Ghent University, an international coalition applied 16 different software-based circRNA identification tools to the same RNA-seq dataset generated from three cancer cell lines. Overall, the software programs were precise, but there was a wide range of sensitivity. "The range of detected circRNAs varied between 1,372 and 58,032 circRNAs per tool in a given cell type," the authors said. Half of all circRNAs were reported by just one tool, circseq_cup, developed by researchers at China's Zhejiang University.
And a combination of two high-precision tools increased the number of detected circRNAs substantially while keeping the number of false discoveries low, the authors said. They published their results earlier this month in Nature Methods.
The results have already convinced Paul Boutros, a circRNA researcher at the University of California, Los Angeles who was not involved with the study, to expand the number of tools he uses at a time. Previously, he used CIRCexplorer, one of the tools benchmarked in the study, as his main workhorse. "We'll move to using an ensemble [of tools] almost immediately," he said. "Unambiguously, I think the study suggests everybody should."
"This work might represent a gold standard dataset for publishing novel circRNA identification methods, and it will also be a valuable guide for experimental validation of circRNAs for research groups working on complex disorders such as neurodegenerative diseases or cancer," Eduardo Andrés-Léon, a bioinformatician at Spain's Institute of Parasitology and Biomedicine López-Neyra, said in a statement.
Boutros agreed that the study filled a "critical gap" in the field. "Circular RNA quantification and detection methods haven't been well benchmarked at all," he said.
Circular RNAs are an abundant class of molecules found throughout all domains of life and are the predominant transcript isoform for hundreds of human genes. They play various biological functions, such as acting as molecular sponges for RNA-binding proteins and microRNAs that might otherwise bind to linear RNAs. Previous studies based on ever-growing RNA-seq databases have found tens, even hundreds of thousands of circRNAs, and their ubiquity in different tumor types suggests that they could be used in diagnostics.
For example, in a 2019 study published in Cell, researchers at the University of Michigan profiled circRNAs in more than 800 tumor and other samples, yielding nearly 129,000 circRNAs in 40 cancer types. Their study also used CIRCexplorer, a program developed in 2014 by researchers at the Chinese Academy of Sciences.
Several startups have been launched to commercialized circRNA research, including Circular Genomics, which raised $4.5 million in seed funding in 2021. Earlier this week, GenomeWeb reported that the firm is planning to launch a diagnostic test for response to depression treatment Zoloft (sertraline) based on a circRNA biomarker.
The benchmarking project was born in 2019 out of a search for a tool to use in work being conducted at Ghent University, which ballooned into an international collaboration. "After two years of workflow optimizations (interrupted by the COVID-19 pandemic), the most exciting time came in July 2021 when we were able to execute more than 6,000 RT–qPCR reactions and get a real view of the empirical validation rate of a large set of predicted circRNAs," first author Marieke Vromman said in a statement.
Of the 315,000 circRNA candidates identified by the tools, 1,560 were validated using three different orthogonal methods: RT-qPCR, resistance to RNase R treatment to confirm circularity, and amplicon sequencing. Based on the validation results, precision and sensitivity metrics were determined for each tool.
Two-thirds of all predicted circRNAs were novel compared to a set of previously reported circRNAs drawn from 13 databases.
Boutros said that he wished the authors had been clearer on how they blinded themselves to certain data to eliminate bias. He also pointed out that the tools were run by the developers, who would know how to tweak parameters to maximize results. "In a way, what they've benchmarked is the state-of-the-art but not the state of the field," he said. "It's a pragmatic approach, but I'd like to see follow-up, blinded validation from non-expert users."
He also noted that the study was limited to RNAs from cultured cells, so he'd like to see benchmarking studies looking at normal and cancerous tissues, and even biofluids, which present different challenges.
Based on their results, the authors generated guidelines for circRNA research. In addition to suggesting multiple discovery tools, "for circRNA validation, we advise using at least two orthogonal validation methods," they said. "Our study might also serve as an example framework for empirical validation of benchmarking results from other bioinformatics tools."
Their recommendations follow another recent publication in Nature Methods on best practice guidelines for purification, validation, detection, and inhibition of circRNAs from researchers at Denmark's Aarhus University.
The authors noted that they only used tools based on short-read RNA-seq data. Long-read tools, specifically those based on data from PacBio's IsoSeq protocol and Oxford Nanopore Technology's direct RNA base detection capabilities have been developed and could help evaluate full-length circRNA sequences.
"Ideally, new tools — for example, based on combinations of existing (short-read) circRNA detection tool strategies and new long-read tools — should be developed and properly validated to further boost precision and sensitivity," the authors said.
Boutros said the benchmarking results suggest the field is still ripe for new algorithms. Precision and sensitivity are usually at odds with each other, so the observed differences in sensitivity, but not in precision, were surprising, he said. "I'm not sure why it ended up that way, but it suggests there's space for new methods that make different precision and recall tradeoffs," he said.