RNA-seq Technical Guide

Table of Contents

Letter from the Editor
Index of Experts
Q1: What are the advantages of using RNA-seq?
Q2: What RNA purification method do you use, and why?
Q3: What library preparation method do you use, and why?
Q4: What strategies do you use to improve dynamic range?
Q5: What quality-control steps do you include?
Q6: What tools do you use, and why, to analyze your RNA-seq data?
List of Resources

Letter from the Editor

Sometimes you have no idea what you are looking for in a gene expression study. You just want to see what is going on and don't want to be limited to what's already known — those unknown genes can have interesting activity too. In that case, an RNAseq experiment might be for you.

For these studies, all the RNA you're interested in is converted to cDNA, fragmented, and sequenced, usually with a next-gen sequencer. This approach, then, doesn't rely on the investigator's prior knowledge or what probes are available, and can get down to never-seen-before resolutions and with less background noise than, say, a microarray.

RNA-seq, though, is a fairly new approach and investigators are still working out all the kinks. So that's what we focus on in this installment of Genome Technology's technical guide series. In these pages, our panel of experts answer questions to help you set up and conduct a high-quality, well-controlled RNA-seq experiment. It's chock-full of good advice and ideas — but if at the end you are still stuck or just want more to read, check out our resources page in the back for papers and helpful websites. Happy hunting for those new genes!

— Ciara Curtin

Index of Experts

Many thanks to our experts for taking the time to contribute to this technical guide, which would not be possible without them.

Rui Chen
Baylor College of Medicine

Nicole Cloonan
Institute for Molecular Bioscience
University of Queensland

Hui Jiang
Stanford University

Brian Wilhelm
Institute for Research in Immunology and Cancer
Université de Montréal

Q1: What are the advantages of using RNA-seq?

RNA-seq provides a more comprehensive view of the transcriptome with one experiment. A list of advantages over microarrays include: not dependent on prior knowledge, no design work required, increased dynamic range and sensitivity due to its 'digital nature', informative for splicing variation, lower cost than tiling-array, scalable in proportion to depth of sequencing, and facilitates downstream applications such as mutation detection or RNA editing.

— Rui Chen

RNA-seq has several advantages over microarrays for studying gene expression.
These include:

(i) the potentially unlimited dynamic range of expression
(ii) the greater sensitivity of the sequencing data;
(iii) the improved ability to discriminate regions of high sequence identity; and
(iv) the ability to profile transcription without prior assumptions of which genomic regions are expressed.

Additionally, as one has the sequence of the expressed regions, examining sequence content in a genomic context— such as changes from SNPs or RNA editing events, length of UTRs, and alternative splicing — allows a more biologically rich interpretation of the data than simple expression levels.

— Nicole Cloonan

For measuring gene expression, RNA-seq provides more accurate measurements as compared to microarrays and has equal or higher throughput. RNA-seq does not suffer from the complicated probe affinity and cross-hybridization effects that in microarrays are extremely hard to understand and model. Because of these, RNA-seq can be more widely used than microarrays, such as in reliably quantifying alternative splicing, allele-specific or isoform-specific gene expressions. Besides gene expressions, RNA-seq is also useful in discovering novel splicing events, novel exons, or even novel genes, which is almost impossible for microarrays because in microarrays, the genome regions that are targeted by the probes have to be known beforehand.

— Hui Jiang

The principal advantage of using RNA-seq compared to using microarrays is the resolution of the data. It would be unfeasible, even for simple eukaryotic model organisms, to design microarrays that have single base-pair resolution. In addition to this, RNAseq data is extremely rich, allowing one to look at SNPs and splicing patterns in data originally obtained in order to look at expression levels. One of the other important advantages of using RNA-seq is that no a priori knowledge of the genome content is required before conducting the experiment. It is therefore possible to sequence the cDNA of an organism without any genome structure or sequence available and build de novogene models based on the RNA-seq data. This differs from the approach of using RNA hybridized to tiling arrays, which does allow unbiased discovery of transcribed regions, but requires genomic DNA sequence information to design microarray probes.

— Brian Wilhelm

Q2: What RNA purification method do you use, and why?

We use an mRNA purification kit from Invitrogen to purify the mRNA from total RNA. After one round of purification, more than 95 percent of ribosome RNA can be removed. Additional rounds of purification can be used in order to achieve higher purity.

— Rui Chen

The purification method chosen should reflect the experimental design and be compatible with the library making process. for example, if you want to sequence small RNAs, then the purification method should allow this — standard column purifications typically lose the RNAs smaller than 200 nucleotides, and other more specialize preparation methods might be needed . We prefer to use column-based cleanups as they provide superior quality RNA suitable for the downstream enzymatic reactions.

— Nicole Cloonan

The method for RNA purification we use involves a first isolation step using Trizol (Invitrogen) followed by a cleanup step using a column (RNeasy, Qiagen). In order to ensure complete removal of any contaminating gDNA in the RNA, a DNase digestion is performed before further processing. It is critical, however, that the RNA sample not have trace amounts of organics left that might reduce the DNase activity, and a column-based purification step consistently yields high-quality RNA in our hands.

— Brian Wilhelm

Q3: What library preparation method do you use, and why?

We use the Illumina mRNAseq samplePrep method to prepare for the library. Using this protocol, 1 μg to 10 μg of total RNA is sufficient to construct the library. By including the fragmentation step, most mRNA is sheared to a narrow size range, which reduces the bias and improves the yield.

— Rui Chen

Given the wealth of antisense transcription, overlapping genes, and other novel features of the transcriptome, generating strand-specific information greatly increases the utility of the data, and therefore we prefer to use library preparation methods that capture the strand of origin (such as SQRL or LEGenD). The downside to this is that strand-specific protocols typically require more starting material, which may not be possible with some samples. In these cases, amplifying the small amount of starting material before proceeding will lose stranded information, but will allow RNA-seq to be performed on limiting amounts of material.

— Nicole Cloonan

For library preparation, after an rRNA removal step (Ribominus, Invitrogen) we use the Whole Transcriptome Analysis Kit from Applied Biosystems for sequencing on the SOLiD machine. This kit allows us to maintain the orientation of the reads and to work using small amounts of starting material. The use of a kit where everything is QCed for the same procedure also removes some concerns about mixing and matching reagents from different companies.

— Brian Wilhelm

Q4: What strategies do you use to improve dynamic range?

Due to the 'digital nature' of RNA-seq, there is not theoretical limitation of dynamic range. The dynamic range of one experiment depends on the depth of the sequencing and the complexity of the library. If one aims at detecting very low-level expression, start with relatively large amount of RNA and increase the sequencing coverage.

— Rui Chen

Depletion of abundant RNA species is the major tool we use to increase the dynamic
Range of RNA-seq — the most common being rRNA depletion. Sometimes multiple rounds of depletion are needed to completely remove the rRNA, and this should be evaluated by Bioanalyzer or similar. However, sub-cellular fractionation, sucrose fractionation of polysome associated RNAs, or other methods of RNA fractionation can further improve the dynamic range of the specific population of RNAs you are interested in.

— Nicole Cloonan

In general, we don't take any particular steps to improve the dynamic range of results as this is rarely a limiting factor. Because the dynamic range of the sequencing data is dependent only on depth of library sequencing, it is useful to have a specific limit set for the depth of sequencing required to properly address the scientific question of a study.

— Brian Wilhelm

Q5: What quality-control steps do you include?

We use an Agilent RNA chip to check the quality of the total RNA and mRNA quality after purification. The size distribution of the sequencing library is determined by gel electrophoresis. Both picogreen and qPCR are used for measuring the quantity of the library before sequencing.

— Rui Chen

Quality control at almost every step is crucial for good library preparation, and the most important QC step is the first one, checking that the underlying biology is right. We often assay RNA destined for RNA-seq libraries on microarray chips first, to verify that the biology is correct before proceeding with costly sequencing. Bioanalyzer (or similar) plots to assess initial RNA quality, depletion or enrichment success, fragmentation sizes, and final library sizes are very important. If it is a new library method, then cloning and capillary sequencing of the library fragments is important to determine whether the method has worked before committing to large-scale sequencing runs.

— Nicole Cloonan

In our lab, we apply several quality-control steps in RNA-seq data analysis. The percentage of reads that are mapped to annotated genes is the most important indicator to tell whether an experiment is successful or not. Usually, if only less than 30 percent of the reads are mapped, it is very likely that there is something wrong. The reasons could be multi-fold. It could be unsuccessful ribosomal RNA removal, or too many sequencing linkers, or unsuccessful machine calibration, or any other possible reasons. Each of these problems requires some additional troubleshooting steps to confirm. Besides the percentage of mapped reads, we also check average sequencing error rate at each position in the reads, nucleotide composition at each position in the reads, and many other possible criteria depending on the experiment.

— Hui Jiang

The most important factor in an RNA-seq experiment is the quality of the initial starting material. The use of high-quality RNA is essential, given the sensitivity of the technique, so we verify the quantity (Nanodrop) and quality of our starting material (Bioanalyzer Chips, Agilent) and also at various stages during library preparatio . The total RNA is also treated with RNasefree DNase I to remove any contaminating genomic DNA prior to reverse transcription. We also try and relate results from downstream analysis back to the sample preparation and machine operation values in order to track problems that are systematic.

— Brian Wilhelm

Q6: What tools do you use, and why, to analyze your RNA-seq data?

We use an in-house pipeline based on blat alignments. Five to six percent of our
RNA-seq reads (75 base pairs in length) contain splice junction event. Blat natively supports intron mapping and other available software were not adequate.

— Rui Chen

We use RNA-MATE to recursively align color-space tags to a reference genome and custom exon-junction libraries (we wrote it, so that's not surprising). Post mapping data can be done using Galaxy (http://main.g2.bx.psu.edu/), e.g. assigning tags to different gene models, etc. After tags are assigned to gene models, the data can be viewed and analyzed in all of the standard ways we are used to for microarray experiments (such as Gene Pattern, Bioconductor, or GeneSpring). The output of RNA-MATE can be used to view the genomic context of gene expression in the UCSC Genome Browser. Other analyses are often done by custom bioinformatics on an as-needed basis.

— Nicole Cloonan

For read-mapping, we use ELAND written by Illumina and seqMap written by our lab. For quality control and gene expression calculation, we use rSEQ, a tool written by our lab. For data visualization, we use UCSC Genome Browser written by UCSC and CisGenome browser written by our lab. We also use TopHat written by the University of Maryland and our in-house tool SpliceMap to detect novel splicing junctions. For all other possible analyses depending on the experiment, we write programs using Linux shell script, python, Matlab, R, and C++. We use our own tools mostly because they are easier to control and also because right now there are not many tools for RNA-seq analysis out there.

— Hui Jiang

We use a variety of tools including both published open-source algorithms and programs developed in-house. For read mapping, we use both MAQ and BWA, which perform well with colorspace data. For expression analysis and visualization, we rely entirely on scripts and tools developed in-house. Although there are some commercial software packages already available to deal with some of these aspects, because RNA-seq data is so rich, it will be difficult for a single software package to cover ever possible types of analysis that researchers might want to do. Developing an in-house analysis pipeline also has the advantage that individual algorithms can be replaced when better ones are developed or additional (possibly project-specific) software can be seamlessly integrated.

— Brian Wilhelm

List of Resources

If you're still stuck, be sure to check out some of the resources below to see if they have the answer to your question.

Publications

Cloonan N, Forrest ARR, Kolle G, Gardiner BAB, Faulkner GF, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM. (2008). Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature Methods. 5: 613-619.

Cloonan N, Xu Q, Faulkner GJ, Taylor DF, Tang DT, Kolle G, Grimmond SM. (2009). RNA-MATE: a recursive mapping strategy for high-throughput RNA-sequencing data. Bioinformatics. 25 (19): 2615-6.

Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, Ecker JR. (2008). Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis. Cell. 133(3): 523-536.

Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research.

Morin R, Bainbridge M, Fejes A, Hirst M, Krzywinski M, Pugh T, McDonald H, Varhol R, Jones S, Marra M. (2008). Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. Biotechniques. 45:81-94.

Mortazavi A, Williams BA, Mccue K, Schaeffer L, Wold B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods. 7: 621-628.

Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. (2008). The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing. Science. 320: 1344-1349.

Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, Schmidt D, O'Keeffe S, Haas S, Vingron M, Lehrach H, Yaspo ML. (2008). A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome. Science. 321: 956-960.

Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, BählerJ. (2008). Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature. 453: 1239-1243.


Websites

Bioconductor
http://www.bioconductor.org/

BWA
http://bio-bwa.sourceforge.net/bwa.shtml

CisGenome
http://www.biostat.jhsph.edu/~hji/cisgenome/index.htm

GenePattern
http://www.broadinstitute.org/cancer/software/genepattern/

GeneSpring
http://main.g2.bx.psu.edu/

MAQ
http://maq.sourceforge.net/

UCSC Genome Browser
http://genome.ucsc.edu/