NEW YORK (GenomeWeb) – ENCODE datasets contain more than 90,000 sequences that could form circular RNAs (circRNAs), according to a recently published study. Scientists at the Computational Genomics Lab at the Beijing Institutes of Life Science found those sequences with a new algorithm they wrote to detect circRNAs, which they say improves upon previous methods of detecting the genetic transcripts.
RNA molecules with covalently linked ends, forming a circle, have been found in all domains of life in myriad sizes, coming from distinct sources in the genome. How they're made isn't perfectly clear, but exon scrambling during RNA splicing could be one way to create them. High-throughput RNA sequencing studies have found an abundance of stable circRNAs and their evolutionary conservation across species suggests they serve an important function. A subset of circRNAs been shown to act as microRNA sponges, but until many more of them are detected and validated, mechanisms of how they form and what function they fulfill may remain a mystery.
The algorithm, described in a paper published last week in Genome Biology, is able to find circRNAs both in data sets created specifically to look for them, as well as RNA sequencing data sets generated by large projects, such as ENCODE. The CircRNA Identifier (CIRI) program scans SAM files for junctions between scrambled exons that might indicate candidate sequences.
The algorithm's advantages are twofold. "A significant proportion of junction reads could be missed" by the two existing circRNA detection algorithms, which depend on either annotated data sets or data sets biased towards circRNAs, the authors said. And because other natural mechanisms can create "junction-like reads," false positives are likely. CIRI uses a second scan to filter out false positives by looking for genomic signals including missing GT-AG splicing signals and using genome mapping statistics.
The three researchers – Yuan Gao, Jinfeng Wang, and Fangqing Zhao – ran the algorithm through a series of tests to validate it. For data sets smaller than 5 gigabytes of RNA-seq data, the program takes less than half an hour, and less than 24 hours for the largest data sets in the study.
First, they ran it on simulated data sets, establishing the basic performance parameters. "CIRI showed good performances for all simulated data with different read lengths and sequencing depths," the authors said. CIRI could detect 70 percent of circRNAs at gene expression coverage levels as low as threefold. They also determined that CIRI is most efficient for read lengths ranging from 60 to 100 base pairs.
After validating the algorithm on samples treated with RNase R (which digests linear RNAs but not circular RNAs) to show it was finding real circRNAs and not false positives, the scientists compared it with the existing detection algorithms.
CIRI outperformed the two other algorithms, using the same RNase R data set, the authors said, primarily because it has a lower false detection rate.
Finally, the scientists used CIRI on real data sets. In a data set from neutrophils and CD19+, CD34+, and HEK293 cells previously explored by another study, CIRI found more than 1,000 more circRNAs than the other algorithms did. In addition, there were 22 circRNAs in that data set that had been validated and CIRI found all 22 of them. In ENCODE transcriptome data from 15 different cell lines, CIRI found 98,526 circRNAs, 75 percent of which had been suggested by a previous study of circRNAs in ENCODE datasets.
While most circRNAs were thought to come from exons, CIRI found evidence of intronic or intergenic circRNAs, including those from anti-sense regions of DNA. In the ENCODE data sets, 19.2 percent of candidate circRNAs were in intronic regions and 5.0 percent in intergenic regions.
Because it is able to detect new kinds of circRNAs, CIRI "will be a powerful tool for detection and annotation of circRNAs and helpful for further exploration of the RNA molecules," the authors said.