Skip to main content
Premium Trial:

Request an Annual Quote

New Ways To Track Transcripts


For a while, to peek into a cell's transcripts, researchers mainly relied on microarrays. But with the rise of transcriptomics, the failings of arrays — the biases of using probes, potential for cross-hybridization, and background noise — are thrown into sharp relief as researchers begin to look globally at transcripts and want to have even more information about them. "The big challenge is to map out all the isoforms and go a lot deeper than anyone's ever gone before," says Yale University's Michael Snyder.

Instead of relying on arrays, researchers are turning to sequencing-based methods to study gene expression. In particular, next-generation sequencing technologies are opening up transcriptomics to move beyond just gene expression to include alternative splicing, different isoforms, low abundance transcripts, and more.

"Ever since high-throughput sequencing became a viable technology, I think one of the obvious applications has been to study gene expression levels — and particularly it offers a lot of potential to use over arrays," says the University of Chicago's John Marioni. "For example, particularly for genes that are only moderately expressed, that's difficult to discriminate the signal from the background in an array. You might be able to identify those accurately using sequencing technology."

In a spate of papers, researchers are comparing and developing next-gen sequencing-based serial analysis of gene expression and RNA-seq to microarrays. So far, it seems the next-gen-based techniques are holding their own — or even slightly surpassing arrays for delving into the transcriptome.


One method that has benefited from the advent of next-gen sequencing technologies is serial analysis of gene expression. In SAGE, the mRNA is converted to cDNA and a tag is added to the poly-A tail of the transcript at a particular restriction site of choice. The tags can be short or long, ranging in length from 10 to 17 base pairs. All the transcripts then making up a SAGE library can be sequenced. Thanks to new platforms, such libraries can now be sequenced quite rapidly.

"These next-generation technologies make long SAGE a much more cost-effective method, where previously we were doing everything by Sanger sequencing. It was more expensive to do long SAGE effectively," says Isobel Parkin at Saskatoon Research Centre. She adds, "I think it's a fantastic technology."

Unlike arrays, SAGE is independent of what is already known about the transcript being studied — though having genomic information can make the analysis much richer — and SAGE can be quantitative.

At the Université de Lyon, Olivier Gandrillon uses SAGE to study gene expression differences between chicken cells undergoing self-renewal and ones that are undergoing differentiation to determine what signals the process of differentiation. "After discussing it with a couple of colleagues, we thought SAGE was much more adapted towards what we were looking for because we had no idea what genes might be involved in [differentiation] because it is a very transient state," Gandrillon says. "We wanted to have the most extensive covering of our transcriptome. We thought SAGE was ideal for that." For his work, microarrays weren't even an option (there are no chicken-based microarrays available), but SAGE can be used on non-mainstream organisms. In his chicken studies, Gandrillon uses the method to see which genes have their expression levels change between differentiating and self-renewing cells, and he says it's reliable. "It really gives you a good idea of what's inside the cell."

"You can use it on any species, even where you have no sequence data available," says Parkin, who studies low temperature responsive genes in Arabidopsis, but is also interested in applying that work to crop plants such as Canola, for which genomic data may not be available.

But having the Arabidopsis genome at hand for her analyses allows her to use SAGE data in a more effective way, she says. In her study of low temperature responsive genes in Arabidopsis, Parkin compared her SAGE data to what had already been uncovered using microarrays. Of the low temperature responsive genes she found, 40 percent had already been identified. The rest were novel. When she compared between the microarray analyses, she found wide variation in the genes they identified. "I think SAGE, if you have enough depth, it's very good for looking at both low abundant and high abundant transcripts," Parkin says. "Microarrays tend to saturate the high abundant transcripts so it is more difficult to try to tell differences in expression."

In addition to identifying transcripts, SAGE can also be quantitative. "Every time you capture a tag, you are effectively counting the number of times that transcript is present. In that way, you can work out the abundance of that particular transcript within an RNA pool," Parkin notes.

One downside to using SAGE might be that it is teetering on the edge of being high throughput. Creating the libraries can be time-consuming and problematic, Parkin says. "[When] we first started, the library construction was a little bit tricky and, at the time, there was no kit available for developing these libraries. But we got quite efficient at that, so that was fine," she says.

The number of steps involved in SAGE, however, has been reduced. Gandrillon remembers the way it was before next-gen sequencing was used, when SAGE had the extra step of creating concatemers that were closed in bacteria before being sequenced. "I would say those cons are now completely relieved by Solexa sequencing technology because the construction is much lighter," he says. "You can reproduce your experiments within [a single sequencing run], if you can parallelize, which we are trying to do."

Like other next-gen sequencing-based approaches, SAGE spits out a lot of data. Parkin and her group wrote their own software to visualize the data they were generating, especially since they wanted to look at it a bit differently than other SAGE users generally do. "In SAGE people tend to look at a particular tag and say, 'OK, that tag has changed in expression,' but you'll find that the same gene may produce more than one tag because you can get alternative transcripts and anti-sense, things like that. We wanted to see, for a particular gene, all the tags for that particular gene and what each of those tags was doing," she says. "It was a challenge, but in a way it was fun and we got a lot more out of the data by doing that."


The newest method for studying the transcriptome takes a sweeping approach by casting a net for all of the RNA. With RNA-seq, all the RNA is turned into cDNA and sequenced — it can be fragmented at either the RNA or the cDNA step. "The bottom line is you take RNA, drop it into cDNA, and sequence that very, very deeply," Yale's Snyder says.

This method only came about due to the influx of next-gen machines. "With the new sequencing technologies that came along, we thought we'd give that a try and see … what we can see from the sequencing," Snyder says. His group tried RNA-seq on yeast and defined the transcribed regions at the highest resolution yet seen, and it showed something unexpected about yeast genes: They overlap and have heterogeneity at their 3' ends.

RNA-seq has the potential to get down into the transcriptome and characterize its structure. "It really has huge potential to be extremely useful, particularly for identifying regions of spliced boundaries, identifying novelly transcribed regions, potentially for exon-specific expression and so on and so forth," says John Marioni at Chicago. "I think the potential of the technology is really huge."

As with SAGE, RNA-seq does not rely on the investigator having any prior knowledge of what is contained in the transcriptome. "You aren't restricted to identifying regions of the genomes for which probes are present," says Marioni, who was part of a team that compared RNA-seq to arrays for technical reproducibility, an effort that was published in September's Genome Research. "We were looking at expression on the whole gene level. We just compared that with state-of-the-art Affymetrix arrays to try and see, does it perform reasonably well? And as we discovered in the paper, it seemed to be the case that the sequencing did perform pretty well for that purpose."

RNA-seq also covers the transcriptome well enough that low-expressed transcripts can be identified without worrying about how much they're expressed compared to background noise. "Microarrays don't really detect low abundance very well because of those cross-hybridization problems, whereas RNA-seq is just linear down as far as you can sequence," Snyder says.

"In terms of identifying differentially expressed genes, the technology also seemed to perform pretty well," Marioni adds.

The Max Planck Institute for Molecular Genetics' Marie-Laure Yaspo created transcriptomes of human embryonic kidney and B cell lines using RNA-seq. The outcome was that she could see very rarely expressed genes. "You can really go very, very low — which you could never access with arrays," Yaspo says.

From her data, she says that she could also study alternative splicing from reads covering exon-exon junctions. She is currently following up on that to study the differential number of reads that come from alternative splicing.

"It's going to be a lot easier to use for identifying splice forms than [the] array-based approach because you can find reads that cross the boundary between adjacent or indeed non-adjacent exons, and that might give you information about what splice forms are present," Marioni says.

Furthermore, RNA-seq is quantitative. "You really have genes which have only one sequence read and genes which would have 2,000 sequence reads. You see that the more sequence reads, the more of the transcript you cover, naturally," Yaspo says.

"RNA-seq is just much more quantitative than microarrays," Snyder adds. "It has an 8,000-fold dynamic range. Microarrays can be about, at best, 100-fold."

Despite its strengths, RNA-seq does have its weaknesses: the technology is still in its infancy, and might turn out to be less quantitative than it looks.

Harvard's Jonathan Seidman thinks RNA-seq might not be the best tool for quantitative work. There are, he says, an average of 300,000 RNAs per cell and the average mRNA is 3,000 base pairs long. If the most abundant RNA is expressed 105 times more than the least abundant RNA, then you'd have to sequence at least 10 million nucleotides — and do that at some depth — to make sure you see everything. "That would be good for looking at splicing, but it's not going to be so good for making distributional, different numbers of RNA in a cell," Seidman says.

Since RNA-seq is a young method, Marioni says that any inherent biases haven't yet been explored. "I think it's probably similar to the early days of microarrays when, for example, all sorts of biases were discovered over the first few years microarrays were developed. The technologies were adapted accordingly and we're probably at a very similar stage with sequencing data," he says. "So it'll take a little while until we really understand all of the biases that are going on and I suspect that experimental protocols … will continue to change." Marioni is particularly interested in how uniformly the transcripts are covered across the sequence.

Yale's Snyder is hoping for some longer reads. Short reads, he says, can tell you where a single splice junction is, but cannot give you the entire splicing pattern. "It would be nice to be able to do more novel discovery. It's hard to do with short reads," he says. "Longer reads will help us find … junctions with novel transcribed regions better."

Meanwhile, the promise of RNA-seq is unlikely to be an immediate threat to arrays for transcriptomics work. Today, Lyon's Olivier Gandrillon says, most projects start out using arrays and move on from there. Eventually, though, "there's going to be less and less arrays and more and more sequencing-based transcriptomics," he says.

Others think that there will always be a spot for microarrays — they're currently cheaper, for one thing. "They are not going to disappear. I think they are relatively easy to use and they don't use much tissue," Seidman says. "For some purposes, they are absolutely great."

The change

The field of transcriptomics is just starting to take off. "The field is transitioning and is going to really map out transcriptomes at a resolution that no one's ever seen before and at an accuracy that no one's ever seen before," Snyder says.

As that happens, researchers can explore the complexity of the transcriptome even further. "We really have tools to look within the cell or the tissue. We can see, for instance, what is the complexity of the transcriptome in the cell, and I think that is very important," Yaspo says. "We apply that to different contexts. For instance, we work on Down syndrome expression profiling, also on several strains of mice of genetically regulated exons which can promote differential alternative splicing — so a slew of projects for which this is essential."

But as researchers dive further into the complexity of the transciptome, more and more data will pile up. "I think we are going to have mammoth amounts of data because of all this next-generation sequencing technology. That seems to be moving so fast ... they are getting more and more reads per run and you are getting longer tags per run," Parkin says. "Then the question is having enough computing power to be able to … store your data, archive it, and analyze it, which is going to become the problem."

All this data, Yaspo says, will help researchers properly annotate the genome, or at least help them learn more about regulatory regions and transcribed regions. "I believe there will be a lot of data that will accumulate on context: cell and organisms, man and mouse — and from that, a lot of work to put together [how] exactly the genome is decoded into a transcriptome. That's what I see for the next coming years."


Transcriptomics Possibilities Lure Sequencing Vendors

Sequencing vendors have seen the transcriptomics trend and are eager to get in on the act. Both Illumina and Applied Biosystems now have kits that look at the transcriptome based on their sequencing platforms.

In early October, Illumina announced a new product, called mRNA-Seq, to produce whole transcriptome data. The product, says Shawn Baker, a senior product manager at Illumina, is basically full-length cDNA sequencing in a single-read kit. First, poly-A plus RNA is isolated, then fragmented and converted to cDNA with adaptors added. That cDNA library is then sequenced using Illumina's Genome Analyzer. "It's a really exciting thing to do. The reason is, it actually offers the most comprehensive, unbiased view of the transcriptome and that's because it's based on sequencing technology. It doesn't require any kind of probe design and it also doesn't suffer from any hybridization-based artifacts like arrays do," Baker says.

Illumina is working on improving the kit's cost, throughput, and data analysis capabilities. Baker says that as the sequencing technology improves, the cost — currently, a single lane costs about $800 to go from RNA to data — should drop. The company is also beta-testing a new software suite called GenomeStudio for sequencing analysis capabilities.

This month ABI will be starting an early-access program to its recently developed SOLiD Whole Transcriptome Expression Kit. With this kit, researchers can look at either total RNA or just mRNA from their samples. (An rRNA removal step can be included.) The RNA is then fragmented, adaptors are added to the ends, and then the RNA is converted into a cDNA library which can be sequenced on the company's SOLiD system. "It's not only the expression levels of those. It's also existence, the location, and if they are alternatively spliced, we get that too [from] the reads. If they contain mutations, we would see that, too," says Roland Wicki, director of SOLiD strategy.

It's still an early-days technology. "The products that are on the market to support that reflect that right now. That's certainly something that we're working on," says Tom Bittick, the product manager of consumables at Ambion. In particular, ABI is developing data analysis capabilities in conjunction with its academic and commercial partners. The company plans to have freely available, open source software similar to what it provides for small RNAs, but for whole transcriptome sequence data.

The Scan

LINE-1 Linked to Premature Aging Conditions

Researchers report in Science Translational Medicine that the accumulation of LINE-1 RNA contributes to premature aging conditions and that symptoms can be improved by targeting them.

Team Presents Cattle Genotype-Tissue Expression Atlas

Using RNA sequences representing thousands of cattle samples, researchers looked at relationships between cattle genotype and tissue expression in Nature Genetics.

Researchers Map Recombination in Khoe-San Population

With whole-genome sequences for dozens of individuals from the Nama population, researchers saw in Genome Biology fine-scale recombination patterns that clustered outside of other populations.

Myotonic Dystrophy Repeat Detected in Family Genome Sequencing Analysis

While sequencing individuals from a multi-generation family, researchers identified a myotonic dystrophy type 2-related short tandem repeat in the European Journal of Human Genetics.