By Monica Heger
Researchers from the Riken Yokohama Institute in Japan recently improved upon a previously developed method to identify transcription start sites in the genome in order to improve the resolution of the approach by more than a thousand-fold.
The methods, called nanoCAGE and CAGEscan and described in a paper published this week in Nature Methods, use sequencing to identify the 5' ends of transcripts with as little as 10 nanograms of starting RNA and could have "applications ranging from drug screening, biopsy analysis, and whole-transcriptome association studies," according to the authors of the paper.
The methods build upon an approach to capture the capped 5' ends of RNA that Riken published in 2003 called cap analysis gene expression, or CAGE, which requires 50 micrograms of starting RNA.
The new methods yield a more complete picture of transcription than typical RNA-seq protocols, enabling researchers to identify tissue specific promoters and alternative promoters, as well as to determine the functional consequences of the promoters, said Piero Carninci, senior author of the study and leader of the functional genomics technology team at RIKEN.
The two methods work together to provide additional transcription information. The nanoCAGE approach dramatically reduces the amount of RNA required to identify the 5' ends of transcripts, while CAGEscan links transcription start sites to RNA products.
Similar to other RNA-seq protocols, the Riken researchers first use reverse transcriptase to convert the RNA to cDNA. However, using the template-switching property of reverse transcriptase, they are also able to select the 5' ends of the capped transcripts, so that the resulting strand of cDNA contains several cytosines that "correspond to the cap structure." The template switching does not require any purification steps, which enables less starting material.
The second advance, which also enables less starting material, is their amplification step. Before amplifying, the researchers used random primers to select for both coding and noncoding RNA. They then developed a method called semi-suppressive PCR, which selectively amplified fragments about 300 bases or longer. In this method the linkers at the 5' and 3' ends of the cDNA had complementary sequences, so short fragments would self-anneal, and not be amplified, while longer fragments would anneal to independent primers. This step filtered out artifacts and non-oriented cDNAs.
Finally, the sequence libraries were prepared by cleaving the cDNA into 25-base fragments with barcodes to enable pooling. The researchers then sequenced libraries prepared from 10, 50, 250, and 1,250 nanograms of total RNA on the Illumina Genome Analyzer and aligned the tags to the human genome. They found that 10 nanograms of RNA and 1,250 nanograms of RNA produced similar pictures of the transcriptome.
The authors of the paper note that the new methods help account for the "emerging view" in research, which "suggests that most genes have multiple [transcription start sites] differing by multiple bases and driven by various core promoters." Carnici said that the advantage of nanoCAGE over RNA-seq is that it allows researchers to identify all these different transcription start sites, giving a more complete picture of the transcriptome.
"RNA-seq is not so good at identifying the real 5' end of the gene," he said. "It is very ambiguous."
Furthermore, the CAGEscan step allows researchers to connect a specific promoter to RNA structure, Carninci said, which is especially important for non-coding RNA. "Many times we don't know if a non-coding RNA is overlapping a coding sequence, or if it is an independent transcript, or if it is promoter of a known protein coding RNA," he said. "But, with nanoCAGE and CAGEscan, we can reconstruct a map."
Transcription start sites are also tissue specific, with genes having different start sites depending on the tissue they are located in. The new methods can identify the specific 5' end in different tissue types, said Manolis Dermitzakis, professor of genetic medicine at the University of Geneva Medical School, who uses RNA-seq to study gene expression and who was not involved with the study.
"When it comes to asking questions like where the regulatory elements are, it's important to know the 5' end of the gene," he said. Not knowing the 5' end is like "designing a house and not seeing where the door is. If you don't know where the door is, you lose track of the whole dynamic of the house."
He added that some genes can have up to 20 different start sites, and being able to accurately identify them is important. With RNA-seq, researchers are more likely to identify the most common start site, and possibly misclassify the alternative start sites as enhancers.
Carninci said that his group is using the technology in collaboration with the National Human Genome Research Institute's ENCODE project, which aims to identify all the regulatory elements of the human genome.
The team is also collaborating with neurobiologists to study Parkinson's disease and neurons involved in brain plasticity. The method will be particularly useful because obtaining large amounts of RNA from neurons is difficult, said Carninci. And, neurons involved in brain plasticity make up less than 1 percent of the neurons in the cortex. "It is very difficult to isolate more than 50 ng, even more than 10 ng of RNA from those neurons, so those kinds of analyses have been impossible."
Carninci added that while his team has used the method on the Illumina platform, it would be easily adaptable to any other sequencing platform.
In terms of commercializing the protocol, he said that Riken already has a patent, and that a spinoff company may offer the technique as a service, although he declined to give details on the company. He also said that the protocol was simple enough that many researchers would be able to reproduce it in their own labs.