Heralded as a revolutionary tool for transcriptomics when it first came on the scene a few years ago, RNA-seq has been steadily living up to its potential. This deep-sequencing approach to elucidating cDNA and other gene expression profiling methods continue to shed light on the complexity of the transcriptome, providing detailed transcript level and isoforms data on everything from bed bugs to solid tumors.
Peter Robinson, a researcher at the Institute for Medical Genetics at Charité-Universitätsmedizin in Berlin, published a paper in BMC Genomics in March that described just how effective RNA-seq is at fleshing out data on model organisms. Prior to that study, roughly 1,556 sheep genes were publicly available that corresponded to partial or complete mRNA sequences. But with RNA-seq, Robinson and his colleagues were able to produce partial or complete transcript sequences for 13,987 genes — all of which are now publicly available in the National Center for Biotechnology Information's Short Read Archive.
Robinson is quick to point out that this is only the beginning of a new era in transcriptomics. "It is not sufficient to study just the transcriptome — integrative experimental and computational strategies will be required to understand the role of transcriptional regulation as a part of cellular networks, and ultimately to understand the role of the transcriptome in health and disease," he says. "However, we can investigate many new parameters — such as transcriptome-wide alternative splicing — for the first time ever using RNA-seq, and many groups are now working on analysis strategies for this kind of data. In general, there is agreement that RNA-seq is a revolutionary technology that is likely to replace microarrays in the near future, but I think we are still learning to interpret all this data."
While some researchers, like Robinson, say that integrating various types of 'omics data with transcriptomic information is the way to exploit advances in cDNA analysis, RNA-seq is facilitating significant discoveries on its own. This is -especially the case when it comes to identifying gene fusions and alternative splicing isoforms that contribute to cancer progression.
Last year, Memorial Sloan-Kettering Cancer Center's Michael Berger was part of a study that used RNA-seq to identify 11 novel melanoma gene fusions caused by genomic rearrangements and 12 novel read-through transcripts. The results of their effort could lead to new modes of target discovery as well as a template for transcriptome studies across multiple tumor types. "RNA-seq is the most direct way to get information about gene fusions, and there are several gene fusions that are very important in cancer and in diagnostics. Some examples are the BCR-ABL gene in leukemia and EML4-ALK in lung cancer, which is very important because there's a new ALK inhibitor that is on the verge of FDA approval," Berger says. "Being able to identify, for diagnostic purposes, gene fusions like this will be critically important. By applying RNA-seq systematically, we might be able to identifying good targets moving forward."
When the project began in 2008, it was just emerging that gene fusions were prevalent in many solid tumors; historically, it had been accepted that they were only present in hematological malignancies and leukemias. "A discovery was made five years ago that said gene fusions were common in solid tumor types like prostate cancer and lung cancer, but it had never been showed before that it occurred in melanoma," Berger says. "Once we generated the RNA-seq data, there were all types of other features we were interested in looking at, for instance the occurrence of alternative splicing, sequence variants, and using the RNA-seq data as a measure of expression levels for these genes."
Similar to the informatics challenges that genomics researchers have faced since the advent of next-generation sequencing, there is currently a dearth of efficient bioinformatics tools to help make sense of RNA-seq data, including for gene fusion identification studies. "There are definitely bioinformatics challenges because the alignment of reads to a reference is tricky as there is always the possibility of alternative splicing and all kinds of novel transcripts that aren't conducive to a linear referencing sequence," Berger says. "Also, a lot of transcripts are redundant because there are different isoforms that are transcribed from the same gene and it's difficult to handle that with most of the widely [used] sequence aligners."
[ pagebreak ]
Bioinformatics developers are making a concerted effort to provide the transcriptomics community with new tools. The deFUSE algorithm, developed by a team at the Centre for Translational and Applied Genomics at the BC Cancer Agency in Vancouver, analyzes all alignments and possible locations for fusion boundaries — unlike existing methods that use only unique "best-hit" alignments and fusion boundaries at the ends of known exons. In a paper published in the May issue of PLoS Computational Biology, the developers of deFUSE describe using the algorithm to identify gene fusions in RNA-seq data from 40 ovarian tumor samples and one ovarian cancer cell line — theirs was the first report of gene fusions in ovarian cancer.
When it comes to guidelines for which cutting-edge gene expression analysis technology is the best approach for a particular experiment, achieving the ideal scenario is not really possible. Errors can be caused by protocols or the source of RNA that, in turn, affect both the quality of results and their interpretation. "There is no perfect protocol — no matter what you do you're going to be making sacrifices on something," says John Thompson, senior director of genomic research at Helicos BioSciences. "The short reads are nice and give you lots of quantitative data, but you don't get the connectivity across the whole message. With the longer reads, you get the splicing patterns, but it's a sacrifice of the number of reads you get."
Thompson published a PLoS One paper in May comparing various gene expression technologies, including RNA-seq, digital gene expression, the new high-throughput versions of the serial analysis of gene expression technique, cap-analysis gene expression, and paired-end ditag sequencing. The study concluded that no matter what the sequencing platform or RNA processing method, most are still subject to artifacts and biases. In addition, just because newer transcriptome analysis technologies produce large amounts of data, Thompson and his colleagues advise against being lulled into believing they have a more accurate picture of gene expression.
Indeed, in December Thompson and his colleagues published a paper detailing research that explored the so-called "dark matter" RNA in -human cells, and how technological limitations may have led some researchers to disregard the importance of this dark matter. By taking a whole-transcriptome shotgun sequencing approach, they found that dark matter RNA, as a percentage of the relative mass of all non-mitochondrial human RNA, is in fact larger than that of protein-encoding transcripts. They also identified the presence of long transcribed regions in intergenic space and illustrated how neo-plastic formation is associated with the expression of these regions.
According to Thompson, techniques such as tiling arrays, cDNA tags, and massive cDNA sequencing identified the presence of this dark matter, but only depicted the non-coding RNA as a small percentage of the total RNA. This was because the majority of cDNA sequencing methods, which use polyA+ selected RNA or amplification methods, selectively omit crucial amounts of RNA. "People were saying dark matter RNA is not important. Well, what you do to your RNA will determine what you will get out, and if you have a protocol that does not look at everything, you're not going see it," Thompson says. "There is a lot of stuff there and it seems to play a big role in development and cancer, but now the problem is that just seeing it doesn't tell you what it does, but what it does do is tell you that it is important because it's regulated, tissue-specific, and developmentally specific."
Researchers studying plant genomics are also looking to capitalize on the power of RNA-seq to explore which genes are expressed at specific points in time. An early proof of principle for RNA-seq in plant studies was a 2010 Nature Genetics paper that analyzed the maize leaf transcriptome using Illumina sequencing. The authors mapped more than 120 million reads to define gene structure and alternative splicing events across the maize leaf, from the stem or bundle sheath where the cells are newest, all the way to the tip of the leaf, where the cells are photo-synthetically active. The data revealed a dynamic transcriptome that the authors hope will provide plant geneticists with a solid foundation for a systems biology approach to studying photosynthetic development.
[ pagebreak ]
"What we wanted to do was tap this developmental gradient that's present in grasses and most monocotyledons. Using RNA-seq, we could monitor the change in gene expression for essentially every gene that's expressed in that leaf along that developmental gradient," says Thomas Brutnell, an associate professor at Cornell University who co-authored the study. "It was one of the first RNA-seq experiments at the time, so we had to write scripts to do a lot of the processing. Now, of course, there's lots of software available, but at the time we had to figure out what was going to work and we decided that [RNA-seq] technology would be the best one."
For Brutnell's lab, the primary focus is currently on perfecting wet lab protocols as well as stream-lining the computational workflow to make sense of their RNA-seq data. This involves hiring postdocs who not only understand the data coming out of the wet lab experiments, but who can also write Perl scripts to do the data processing. "Now that we've set up this RNA-seq pipeline, we've gone back and sampled 15 different points along that leaf for a lot higher resolution," Brutnell says. "We have each gene expression represented by 15 data points, so there is a much higher resolution to look at the patterns of gene expression and identify cis-regulatory elements and upstream promoter elements that may be enriched in certain clusters, and also to look for transcriptions factors that may be enriched in certain clusters."
Much in the same way that some agricultural companies currently use DNA sequencing to do high-throughput genotyping of virtually every seed that goes into a nursery, RNA-seq could also be used to enhance selective breeding. "Imagine if you had that capability for monitoring the RNA — you could look to see which genes are expressing and [which] make them more drought tolerant, so people are starting to do this now," he says. "There will be a discovery phase of trying to understand the network of genes that help a plant respond to things like drought, pathogen attack, and salt tolerance, and you could use that information to guide breeding programs."
Older techniques for cDNA sequencing, including cap analysis gene expression, or CAGE — a method for high-throughput sequencing of 5' ends of capped RNAs — are being reborn as higher-throughput incarnations. Developed by Piero Carnici, a team leader at the Omics Science Center at the RIKEN Yokohama Institute, nanoCAGE streamlines a transcriptomics workflow and allows researchers to experiment with smaller fractions of RNA, on the order of 10 nanograms. The original CAGE technique required large amounts of RNA that often cannot be extracted from refined samples, like tissue micro-dissections, but Carnici has been able to use nanoCAGE to characterize the transcriptome of homogeneously purified neurons.
Carnici and his colleagues plan to continue to refine nanoCAGE to meet some of these challenges and make the tool easily adaptable for users. "We want to push nanoCAGE down to [single-cell] sensitivity because nanoCAGE allows detection of a broader number of RNA transcripts, including non-polyadenylated RNAs, a class that include a very large number of long non-coding RNAs," he says. "We are also standardizing the method to make libraries in 96-well format for high-throughput preparation of material for the next-generation sequencers. This is part of a broader project to standardize and simplify the CAGE technology to reach many scientists and these protocols will be suitable for the production of kits, which will help to spread the technology to a larger number of laboratories."
From a technology innovator's perspective, Carnici says that despite all the expression analysis tools that have cropped up as a result of next-generation sequencing, the major challenge for transcriptomics is the nature of RNA itself. Few RNAs are very abundant — some are only present at less than a copy per cell. Another issue is that functional RNA can be as short as 18 nucleotides or up to hundreds of thousands of kilobases long. It is practically impossible to obtain full-length cDNA molecules for most transcripts longer than 15 kilobases. This often requires investigators to employ multiple protocols, each with its own biases, for one sample — even RNA-seq has limitations in that it captures only fragments of longer RNAs.
These limitations aside, Helicos' Thompson says RNA-seq and related technologies will revolutionize transcriptomics research because they are so much better than array-based systems. But, researchers will likely have to deal with the protocol juggling act for some time. "I think that's why people didn't pay attention to the protocols initially, because no matter what you did it was better than the array data," Thompson says. "The bottom line is you have to be careful, because if you want to compare across labs and models, you have to understand what the differences are when you do different protocols. People did that with arrays over the years and really controlled things and made things so they were really highly comparable and the same thing has to happen with RNA-seq."