Researchers from the US and Australia have developed a method called RNA CaptureSeq that combines RNA capture and sequencing-to-saturation in order to mine the depths of the transcriptome — a strategy described online recently in Nature Biotechnology.
"The key is really the depth," the study's co-corresponding author John Rinn, a Harvard University stem cell and regenerative medicine researcher, told In Sequence. He likened the approach to a submarine diving into the furthest reaches of specific sites in the transcriptome.
He and his colleagues used custom tiling arrays to target transcripts produced from a subset of the genome, including regions suspected of harboring more transcript diversity than has been documented before. After reverse transcribing RNA and hybridizing complementary DNA to the arrays, they used very deep sequencing to characterize as many transcripts as possible from each region, helping to uncover protein-coding isoforms and long non-coding RNAs, or lncRNAs, that are missed by standard RNA sequencing.
By enriching for nearly 2,300 regions of interest from fetal foot fibroblast cells prior to sequencing, for instance, the investigators showed that they could get nearly 400 times as many aligned transcript reads in regions of interest as they could by doing RNA sequencing without enrichment.
In the process, they detected previously unknown transcript isoforms for protein-coding genes, including well-characterized genes such as p53 and HOX. They also identified a host of long non-coding RNAs and hundreds of rare transcripts corresponding to intergenic regions of the genome.
Beyond the insights it may offer into the nature and extent of transcriptional diversity, the approach also has promise for interpreting genetic data from disease studies and delving into the details of host-pathogen interactions, its developers said.
Though it does not provide a genome-wide view of transcription, the CaptureSeq method offers a clearer view of coding and non-coding transcripts from specific sites in the genome that are expressed infrequently, at low levels, or in a sub-population of cells.
While Rinn said it is difficult to directly compare the cost of RNA CaptureSeq with that of conventional RNA-seq, he explained that the targeted approach makes it possible to get coverage that would not be feasible with transcriptome-wide sequencing approaches.
"There are those users who are interested in a very specific region of the genome or a very specific question about the genome and those who are interested in its entirety," he said. "The former are the ones who benefit from this, because you get this incredible, cost-effective information on a specific region of the genome."
The method stemmed from efforts to get deep sequence data for certain genes of interest, he said, a goal that prompted Rinn and his colleagues to team up with investigators at Roche NimbleGen in 2008 to design suitable tiling arrays.
The researchers have since expanded the scope of their studies, he added, looking at transcriptome patterns in both protein-coding and non-coding regions of the genome. They are now designing arrays to capture and distinguish between RNA from humans and pathogens such as the malaria-causing Plasmodium falciparum.
RNA capture has been combined with sequencing in the past. For example, early this year, Agilent Technologies announced that it was offering SureSelect kits to target genes of interest for gene expression analyses (IS 2/8/2011).
But by coupling this capture to ultra-deep sequencing, Rinn and his colleagues hoped to not only get expression insights for known transcripts, but also to find rarer isoforms and yet undetected transcripts.
"Although this ability to isolate and target RNA has been used in genetic analysis for some time," they explained in their paper, "here we combine this ability with deep-sequencing technology to provide saturating coverage and permit the robust assembly of rare and unannotated transcripts."
For the current study, the researchers used custom NimbleGen Titanium Optimized Sequence Capture 385 tiling arrays that were designed to capture 770,000 bases of RNA from 2,265 regions of the genome. Included in the targeted sites were genes coding for roughly 50 known protein-coding genes and lncRNAs, along with a host of intergenic regions previously shown to have little or no evidence of transcription.
Despite the dearth of RNA sequences associated with these intergenic regions in the past, they contain chromatin modifications that are associated with actively transcribed genes. In 2009, Rinn, Broad Institute founding director Eric Lander, and co-authors published a paper in Nature showing that it was possible to track down conserved, non-coding RNA using such chromatin information.
"The way DNA is packaged has a unique signature for actively transcribed genes," Rinn explained. "So we took some of these regions and tiled them with multiple probes to see what was underneath that chromatin signature."
The researchers used both Illumina and Roche 454 platforms for the RNA CaptureSeq experiments described in the new study, though Rinn called the Illumina short-read platform the "workhorse" behind the method.
Because it offers much deeper coverage of the targeted sequences, Illumina short-read data provides a look at the breadth of the transcriptome and the abundance of various isoforms, he explained.
Longer-read platforms such as Roche 454 or the Pacific Biosciences RS are useful for getting more detailed information about splice sites, genetic variation, and gene editing, he noted, but come at the cost of sequence depth.
"The longer the reads, the more confident you are in a given mutation," Rinn said. "So PacBio, 454, and other companies that are going to give us long-read information can really help us home in on where the genetic variants are or where the sequence alterations are."
By doing sequence alignments using reads from each platform independently and in combination, the team was able to explore the transcript diversity and splicing patterns in the targeted regions of the human fetal foot fibroblasts tested.
In their non-enriched, pre-capture RNA-sequencing sample, researchers reported, they generated 48,091 transcript sequences with multiple exons. Even so, the proportion of reads that aligned to the targeted regions was much lower using this standard RNA sequencing strategy than it was in experiments employing array-based sequence enrichment.
Whereas 0.21 percent of the 20.4 million standard RNA-sequencing reads aligned to the targeted areas, more than 80 percent of the 25.8 million paired-end reads generated in the CaptureSeq experiments aligned to these regions, providing a mean sequence depth of 4,607-fold.
"Given that RNA CaptureSeq achieved [an approximately] 380-fold enrichment for alignment coverage across targeted regions of the transcriptome," the study authors noted, "we extrapolate that [approximately] 10 billion aligned sequence reads from a single sample by conventional RNA-seq would be required to achieve an equivalent coverage depth across this targeted transcriptional region."
Using the CaptureSeq data, researchers found hundreds of new transcript isoforms for 55 protein-coding genes in the targeted regions, including the oft-studied gene p53. They also uncovered alternatively spliced versions of the lncRNA HOTAIR, numerous rare transcripts from intergenic regions tested, and 163 lncRNAs that either neighbored protein-coding genes or were expressed from the antisense strand of genes.
Nearly a quarter of the newly detected transcripts were not found by RNA sequencing alone, they noted, while another 10 percent were represented by a lone read in the non-enriched RNA sequence data.
The researchers got shallower coverage of the targeted regions when they did the same RNA CaptureSeq experiment using the Roche 454 GS FLX Titanium platform, with roughly 315,000 of the 454 reads aligning to the genome.
The longer read data helped to verify almost 65 percent of the transcripts detected by Illumina sequencing and provided a more detailed look at splicing patterns in the rare, previously unannotated transcripts, including those from the intergenic regions tested.
Even after ignoring reads without evidence of post-transcriptional splicing, the researchers saw hundreds of new splice junctions and multiexon transcripts in intergenic regions. Most had not been reported before and could not be detected in non-enriched RNA-seq data, likely because they are expressed at low levels and/or in just a subset of cells.
Based on such findings, researchers suspect that the transcriptional differences between cells from various developmental stages and tissue types — and even within a population of cells — may be more extensive than previously appreciated.
"Single cell population studies are critical now," Rinn said. "We need single-cell population studies to understand whether this is a sporadic event or if there are microniches or sub-populations of cells that are expressing whopping amounts of these [rare transcripts]."
He and his colleagues are continuing to collaborate with NimbleGen on deep sequencing studies of targeted RNA, but are now taking the RNA CaptureSeq method into other arenas, including studies of host-pathogen interactions.
"We want to use [CaptureSeq] for more and more difficult applications — things like separating out species and perhaps even getting it into the field where you can start collecting blood and hybridrizing it out in the field," Rinn said.
For example, he explained, the team is using the approach in malaria studies to help distinguish parasite sequences from human sequences using arrays designed to capture transcripts from each species.
"Really we're using it now as almost a nucleic acid separation methodology between species," Rinn explained. "Something like this can be used to understand what's going on in vivo in a parasite."
The approach also has potential for interpreting results from genome-wide association studies, the study authors noted, particularly for disease-associated variants that fall within parts of the genome that lack protein-coding genes.
RNA CaptureSeq "provides considerable value because it allows one to focus on and comprehensively interrogate regions of interest," the team wrote. "For example, it can comprehensively profile haplotype blocks identified by genome-wide association studies to be associated with complex disease or phenotypes, many of which occur outside of coding genes so as to identify all gene products produced from these regions as the next step in determining causality."
In cancer studies, researchers believe that deep sequencing of targeted transcripts could provide a better understanding of how transcripts expressed from amplified regions of cancer genomes compare to transcripts from the same region of matched normal samples.
"You can think of it as a molecular census," Rinn said "Instead of doing a census where you survey one in every 100 houses, this surveys every single house."
Have topics you'd like to see covered in In Sequence? Contact the editor at anderson [at] genomeweb [.] com.