Skip to main content
Premium Trial:

Request an Annual Quote

Q&A: Noam Shomron on Software for Analyzing MicroRNA Deep Sequencing Data


noam.jpgResearchers at Tel Aviv University have released miRNAkey, a software package that lets users process and analyze microRNA deep sequence data and generate detailed reports of differentially expressed miRNAs in paired samples.

In a paper published recently in Bioinformatics, the authors write that miRNAkey "is an intuitive tool for the implementation of the first steps of analysis of deep sequencing data obtained in miRNA sequencing experiments."

Users input their data files in FASTQ or FASTA format and for each file receive an Excel spreadsheet containing the results of the analysis, a detailed description of the analysis, plots of post-clipping read lengths and multiple alignment rates, and mapped files in SAM format, among other data files.

Analysis steps include locating and removing adaptor sequences, mapping reads to sequences stored in miRNA databases, and counting reads mapped to different miRNAs and converting them into the RPKM (reads per kilobase per million mapped reads) index, which lets users make comparisons across different samples. Other steps include quantifying differential expression for miRNAs between paired samples and generating information about the data, such as multiple mapping levels.

The developers also incorporated the Seq-EM algorithm, which was developed by researchers at Tel Aviv University and the University of California, Berkeley. Seq-EM is a maximum likelihood and expectation-maximization algorithm that sorts through multiple reads that map to the same location in reference sequences. In miRNAkey, Seq-EM is used "to optimize the distribution of multiply-aligned-reads among the observed miRNAs, rather than discarding them, as is commonly done in this type of analysis," the authors wrote.

Discarding these reads, which make up about 30 percent of mapped reads in human samples, can lead to "significantly different and biased expression profile" the authors said.

According to miRNAkey's developers, other miRNA analysis tools, such as miRDeep, developed by researchers at the Max Delbrück Center, don’t provide differential expression analysis, while others, such as University of East Anglia's sRNA toolkit and the Center for Cooperative Research in Biosciences' miRanalyzer, involve many processing steps and are web-based, which imposes some restrictions on file sizes and requires a long time to upload data.

This week, BioInform spoke with Noam Shomron, one of the developers of miRNAkey and a co-author on the paper. Below is an edited version of the interview.

MicroRNA analysis is still relatively new and as far as I can see, there isn’t a lot of software that’s available for researchers. Tell me about the niche that miRNAkey fills.

We combined two emerging worlds in scientific research and technology. One of them is miRNA, which really gained momentum in the past few years as master regulators of many cellular processes involving or leading to human diseases. The other field is deep sequencing, which allows researchers to map tens of millions of reads and look ‘deep’ into their experiments. We started using deep sequencing to look at miRNA expression and saw an amazing amount of new and novel data. We wanted to make sense out of all of it and so we generated miRNAkey.

In our first experiment we took diseased and normal tissues, squeezed out the small RNAs from them, processed them on an Illumina [Genome Analyzer], and mapped all the small RNAs in these samples. Our output was tens of millions of reads of small RNAs per experiment and the first thing we did was look at all the known small RNAs and compared them to the miRBase database, in order to receive a list of all miRNAs in both samples.

To understand which miRNAs have changed significantly we had to invoke a statistical measure and normalize the changes in order to get a p-value out of it. So the input into [miRNAkey] are millions of short RNA reads from experimental samples and the output is actually a very user-friendly Excel file or table that lists how many miRNA reads one has in each sample, a statistical correction, and a p-value, the name of the miRNA and links to miRBase, PubMed, and other resources.

To answer your question, we needed a very easy and simple pipeline to process our data that could run on a single computer, not on a cluster. We have user-friendly software that you can easily and quickly install and run on your own Macintosh or Unix machine. You can run large amounts of data generated in experiments and get a very simple output understandable by any researcher.

Give me some background on why you developed miRNAkey.

It was purely a need. We had just initiated the genome laboratory at Tel Aviv University and I was interested in small RNA, so the first experiments we ran on our deep sequencers were derived from these types of experiments. When we finished sequencing, we had an enormous amount of data and no easy way to analyze it. We looked at common tools out there and we found that none of them really fit our needs. All of them were either too sophisticated or complicated to run.

We decided that if we want something done, we should do it ourselves. We required software that could be a part of a processing pipeline, where we would know what's inside but once we finished it could be used by others automatically with little intervention on the analysis part. The input would always be the deep sequence data, while the output had to be really easy to interpret and readable as we often process samples from various experiments from other researchers in other labs in Israel and abroad.

Interestingly enough, we call it miRNAkey because it's the key that opens the door to visualizing all the miRNAs, but also 'nakey' in Hebrew means 'clean' so it also cleans your sequences. So there is a twist to the name: it not only cleans your sequences but also opens the door to further experiments.

It seems to me that there are several individual applications for miRNA analysis but there aren’t as many software packages that bring all the applications together. Was that one of the reasons why you developed miRNAkey?

Yes, that’s the reason why we decided to develop [miRNAkey]. When we started there was only miRDeep out there, which didn’t meet our requirements. We were using [miRNAkey] and it took us a while to gain confidence in how well it performed. At the time we sent it for publication, we noticed that there were maybe five or six additional [miRNA analysis tools], which we mention in our paper.

What's the difference between your software and something like miRDeep, for example?

MiRDeep is different because it requires you to use some skills that not every biologist has. Our software has a graphical user interface, which means you can download it and you don’t need any programming skills for it. The screen is very simple and user friendly. You can double click on the files you upload, you have windows where you can select the data you want to compare, the different parameters for your analysis, and then you just press 'submit'. It runs and gives you a report.

Did you develop the different components of miRNAkey from scratch or did you use any existing open source applications?

We developed most of the applications from scratch. We are supported by the Burrows-Wheeler Aligner and other software for sequence alignment, but the rest of the components were designed from scratch.

You mentioned in the paper that one unique feature of miRNAkey is that you used the Seq-EM algorithm. Could you elaborate on why you chose to use that particular algorithm?

MiRNAs are very short sequences and when you align them to a reference sequence you might get ambiguous hits because a short sequence can hit different regions in the genome. What most software [tools] do is discard these sequences and use the remaining sequences. We try to rescue these reads, and use an algorithm that can tell us where they most likely come from. Once we figured that out, we did not have to discard them.

Using the Seq-EM algorithm, we were able to rescue more than 10 percent of our data, making it much more accurate because we were discarding fewer sequences by statistically deciding where they map even though we get several hits per one miRNA.

Can you describe some ongoing research projects that are using miRNAkey?

We run many projects with small RNAs such as infecting human cells with different viruses and trying to understand how small RNAs from the virus fight the host and how the host fights the virus with small RNAs. We also have patient tissues from different stages of cancer development, and samples from patients with brain disorders.

Give me a feel for what's going on in the miRNA analysis field.

Researchers moved from [microarray] chips and qPCR platforms to deep sequencing because they wanted to look ‘deep’ into their sequences and observe the genuine small RNA sequences rather than a relative comparison of them.

These experiments led to exciting findings. It now seems that a miRNA is not exactly what is catalogued in the miRBase database. There might be different isoforms of the same miRNA. The miRNA sequence might shift slightly up- or downstream by one or two nucleotides. So if you profile miRNAs on a microarray or with qPCR, you won't be able to detect the ones that have shifted because you aim for high stringency with those technologies. Using deep sequencing, you receive the complete picture because you not only receive the precise levels of miRNA but you also receive the ones that are slightly out of sequence though they might still have a functional cellular role. We are now working on incorporating this information into the next version of miRNAkey.