For decades, a biologist could think of a gene as a protein-coding piece of the genome. There were exceptions, although these noncoding genes encoded RNA that tended to have housekeeping or structural roles. It was obvious that the non-protein products were important for the cell, but it was hard — for some of us, anyway — to imagine these RNAs playing many cool regulatory roles in processes like development or the progression of disease. As a result, we could concentrate on protein-coding genes when compiling gene sets, designing microarrays, and doing other large-scale biology.
But with the discovery of small RNAs such as microRNA and Piwi-interacting RNA, we've had to expand our thinking, and with the increasing evidence for large-scale transcription of things that we don't normally call genes, our regulatory models and systems-level approaches are getting even more complex. Here we discuss some of our experiences with the integration of these non-coding RNAs into our genome-scale analyses. As with others who model natural processes, on one hand we're disappointed to have to increase the complexity of our models, but on the other hand we're pleased to know that the ever-complex models are getting closer to reality.
When the first miRNA (lin-4) was discovered in C. elegans in 1993, it looked like an exceptional observation that could make worms proud. But seven years later, it was clear that lin-4 had some company; let-7 was found in C. elegans, and homologs were found in lots of different creatures, including humans. While figuring out how these miRNAs got created and did their work, a lot more were discovered, either experimentally or computationally, leading us to realize that in some species, miRNA genes could make up more than 1 percent of all the genes. It also became apparent that many miRNAs could downregulate levels of mRNA or protein in a gene-specific manner. So in less than a decade we gained hundreds of examples of a new type of gene, a novel, about 22-nt RNA gene product, and a whole new mechanism of regulating mRNA and protein.
Just a couple of years ago, a new class of 25- to 30-nt RNAs was discovered and came to be known as piRNAs. These also regulate gene expression, although right now the breadth of their impact is not so clear; they play key roles in germline development and possibly quite varied roles in somatic cells. Other types of recently discovered small RNAs include 21U-RNAs and sno-RNAs, and it's pretty easy to imagine other sorts of small RNA genes hiding in our genomes. It's also increasingly obvious that a lot more than what we currently call "genes" gets transcribed, so what's going on with all of those regions of the genome? Are they genes? Whether we want to call them genes or not, what are they doing?
New methods for RNA study
So what's our concern with all of this? First, we started out with the idea that proteins are the key players. But now we find all these other genetic entities that couldn't care less about the genetic code; they play by other rules and must talk in some other sort of code. Second, we tried to get a comprehensive gene set for each species, and we were almost there — being optimistic as we are — with protein-coding genes, but now we have all these new miRNA and other entities and may even have to go back and reconsider our definition of "gene." Third, we really wanted to be able to quantify the level of every key player (our beloved proteins), but it was just too hard to make antibodies for all of them. Fortunately, we could quantify the next best thing, mRNA, keeping our fingers crossed that transcript abundance is a good approximation of protein abundance. But here come miRNAs that can regulate the levels of mRNA or protein. In the latter case, mRNA abundance can become a poor proxy for protein abundance, and micro-arrays are in the dark about this whole other level of regulation. And just as bad, we don't even know how much of an issue this is (in other words, the prevalence of each miRNA in our favorite cells) because the wonderful array repositories are full of data that ignores their presence. Lastly, we can't just add more spots to our popular microarrays, as mature miRNAs are too short for even the shortest oligonucleotide probes.
Finally, finding promoters of -miRNAs is not trivial, as we can't just go upstream from the mature miRNA or even its stem loop precursor, since primary miRNA transcripts don't hang around long enough. Fortunately, scientists have designed new gene-finding methods, developed new technology to quantify these short and not-so-short RNAs, and expanded regulatory models. As with identifying protein-coding genes, it's easiest to be confident about novel genes when they're conserved across multiple species, but it can be really tricky to identify species-specific genes. Now we have to be sure that our gene sets include these noncoding genes, so instead of just using RefSeq sequences with NM_* identifiers, we also have to go to miRBase to get the most recent miRNA genes and probably other databases to get other types of noncoding RNAs. Histone modifications maps have been recently applied to identify miRNA transcriptional start sites, so now we can get a handle on their promoters. Enough miRNAs have been reliably identified that array manufacturers can design miRNA arrays, and statisticians are taking another look at normalization, since miRNA arrays violate some of the assumptions of the "whole genome" arrays. Perhaps even more exciting is the current development of high-throughput -sequencing, which can help identify and quantify these relatively novel (as well as traditional) RNAs without needing a reference gene set for probe design. These methods permit an unprecedented look at the transcriptome without having to resort to huge sets of tiling microarrays. Meanwhile, yeast biologists can feel a little left out, with no miRNAs to call their own. What sorts of fungal-specific methods of regulation will they find hidden in their genomes?
What is being regulated by miRNAs and piRNAs? PiRNAs help keep transposable elements from going wild, whereas miRNAs regulate specific mRNAs or proteins — but which ones? Figuring out all the targets of each miRNA is an ongoing area of research, and so far most of the target sites have been identified in the 3' untranslated region of protein-coding genes. For less well-characterized transcripts, it's pretty hard to simply find the full-length coding region, but finding full-length UTR is even harder. Regardless, good UTR annotation is a prerequisite for predicting or explaining miRNA targeting, so complete gene annotation (beyond just the protein-coding bits) is as important as ever. Nevertheless, we can at least start to add this current miRNA target data to networks of regulators like transcription factors and get a more comprehensive look at the control of transcript and protein levels.
On the subject of genome annotation, what can we make of the RNA from widespread transcription outside of, or antisense to, known genes? This is a discovery that could have even broader implications on genome function, whether these end up being "genes" or not, and the regulation of mRNA abundance. There are plenty of challenges in discovering all of these different types of RNA and figuring out what they do and how they do it.
Even though we have to add other levels of complexity to our models of transcript and protein regulation — and make sure we include these different flavors of noncoding RNAs in our gene sets — getting a handle on all of this new RNA will keep a lot of biologists very interested and busy. Likewise, it looks like our genomes and those of our fellow eukaryotes may still have a lot of quite interesting noncoding RNA bits hidden in them.
Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a senior bioinformatics scientist in Fran's group.