Stanford University researchers have demonstrated the potential of using Pacific Biosciences' single-molecule, real-time, long-read sequencing to assess the repertoire of complete RNA molecules present in the human transcriptome.
"The advantage of the long reads is that you're able to see the complete picture and map out all the isoforms, which is not trivial," Michael Snyder, genetics chair at Stanford and director of the Stanford Center for Genomics and Personalized Medicine, told In Sequence.
Snyder and his team pooled samples from 20 organ or tissue types before sequencing the polyadenylated portion of the RNA using an amplification- and fragmentation-free PacBio sequencing method that reads complementary DNA inserts within circular molecules. Results of the analysis, published in Nature Biotechnology earlier this month, suggest the method can reliably glean full transcript sequences for RNA molecules stretching out to around 1,500 bases — or longer, in some cases.
Using the single-molecule, long-read approach, for instance, the investigators detected thousands of transcript isoforms already described by members of the GENCODE consortium, along with splice site combinations not reported previously. That tally of new isoforms is expected to continue increasing as additional long-read transcriptome studies are done at deeper sequence coverage.
For their part, Snyder and his colleagues are intent on applying the single-molecule, long-read approach to individual human tissue types, rather than pooled samples, in an effort to profile and quantify tissue-specific transcript isoforms and long intergenic non-coding RNAs, or lncRNAs.
He and his team members have started applying newer versions of the PacBio technology and chemistry, which are expected to help in generating longer reads that boost the accuracy of circular consensus sequencing and make it possible to tackle samples with fewer sequencing runs than described in the current study.
Based on results so far, the group is optimistic about the prospect of using PacBio or other long-read methods to get a more refined look at RNA sequences and structure than has been possible using short-read, fragmentation-based approaches.
"We really need to try to move beyond the short-read sequencing for transcriptomics, because we found in the paper that you miss a lot of novel transcripts when you don't interrogate very long reads," co-first author Donald Sharon, a graduate student in Snyder's Stanford lab, told IS.
That's because existing methods for sequencing transcriptomes with short-read platforms typically involve fragmenting cDNA into pieces that are just a few hundred bases long prior to sequencing, he noted.
Such fragmentation makes it tricky to reassemble sequences into full transcripts after sequencing, obscuring information about the full suite of isoforms that may represent a given gene, for example.
"The object we're really interested in — which is the RNA — we fragment it and tear it to pieces, analyze the pieces, and try to make sense of those pieces," noted co-first author Hagen Tilgner, a post-doctoral researcher in Snyder's lab.
"The best way to know what something looked like before you broke it is to not break it in the first place," Sharon said.
So while fragment-based, short-read approaches make it possible to track gene expression, identify sequence variants (IS 12/13/2011), and see which exons are adjacent to particular splice junctions, for example, they don't provide detailed information about each transcript as a whole.
"We are good at seeing the general quantity of how much message is coming from each gene," Snyder said. "But the nature of the transcripts and such is really poorly defined because we're only looking at fragments."
Moreover, data available so far suggests each gene is represented by five different transcripts isoforms, on average, he pointed out. And that number is expected to continue climbing as the human transcriptome is characterized in more detail.
Transcriptome sequencing has taken off over the past few years, from early transcript sequencing studies using Illumina platforms (see IS 5/6/2008, for example) to more recent efforts that relied on Roche's 454 reads to stretch out the length of RNA transcripts that could be scrutinized (IS 3/9/2010).
The availability of 454 reads made it possible to generate reads that were more than 500 bases long and to see complete sequences for around 26 percent of human transcripts, Tilgner said, representing isoforms at the shorter end of the spectrum. But even those relatively pricey reads didn't come close to covering complete transcripts sequences for the majority of genes.
The introduction of longer and longer PacBio reads has provided another option for those interested in profiling full, or nearly full-length, transcripts, Snyder said.
Early versions of the PacBio technology produced reads on the order of 900 bases or so, with fairly pronounced error rates, he noted. But those read lengths have since stretched out significantly, making it possible to read around most inserts multiple times using the so-called circular consensus sequencing approach.
In particular, because PacBio error profiles are typically random, it's possible to come up with fairly accurate consensus sequences for a given insert by generating circular consensus, or CCS, reads that cover each insert two or more times.
"The reads are now so long that you can read around the same insert multiple times and get a much more accurate read," Snyder explained, "and that works well even for full-length transcripts."
To take a crack at using that technology to interrogate human transcripts, he and his colleagues pooled RNA from 20 organ and tissue types in their current proof-of-principle foray into single-molecule, long-read transcriptome sequencing.
The researchers decided to forego an amplification step in the current study since they were able to get sufficient starting material. Though amplification is often necessary, Tilgner explained, it can also introduce biases and obscure RNA quantification.
"The price we paid for that is that very lowly expressed transcripts, we simply don't see," he said. "On top of that, we need lots of input material."
The library preparation was also fragmentation-free, meaning the circular consensus molecules contained inserts spanning a broad range of RNA sizes.
Within the 476,000 CCS reads generated from the pooled starting RNA material, the researchers identified 5.1 million sub-reads, representing inserts within each circular consensus molecule considered. That sub-read length came in at around 1,000 bases, on average, with initial cDNA insert sizes apparently keeping many reads shorter than 1,500 bases.
Researchers found that nearly 99 percent of their CCS reads mapped to the human reference genome at least once and appeared to correspond to cDNA sequences some 85 percent of the time.
Comparisons with existing human transcriptome annotations suggested that the approach produced complete transcripts for most genes under 1,500 bases long — results that the researchers further validated through follow-up PacBio sequencing experiments using an External RNA Control Consortium reference library comprised of 92 transcripts of defined length and quantities.
For transcripts longer than 2,500 bases or so, the researchers estimated that they generated sequences spanning full-length RNA transcripts roughly half the time.
Using information on CCS reads with split mapping patterns, meanwhile, they were able to get a glimpse at splice sites present in the original pooled RNA sample. That analysis revealed isoforms representing more than 14,000 annotated GENCODE genes, the team reported. Of those, an estimated 10 percent appear to be isoforms containing newly described splice site combinations.
"It's not yet perfect, because there are genes that are very, very long," Tilgner noted. "But it was a very significant step forward compared to 454 [transcriptome sequencing]."
The technology has improved since the current study was published, he and his colleagues noted, prompting enthusiasm about interrogating complete transcripts that are 5,000 bases or longer with CCS reads.
"This technology, especially with latest versions coming out, will be very, very useful for defining the whole transcriptome," Snyder said.
Whereas the team had to use dozens of SMRT cells on the RSI instrument for the current study, the availability of an RSII instrument and improved chemistry are expected to make it possible to generate longer — and, consequently, more accurate — CCS reads at a higher read density of around 40,000 CCS reads per SMRT cell.
While it's still not clear how much sequence coverage is needed to find isoforms present in miniscule amounts in human tissues, results from the study support the notion that deeper and deeper sequencing can dig up low-level isoforms missed at shallow coverage.
Going forward, additional work will also be needed to understand the biological relevance of those scantly represented transcripts, if any — investigations that are expected to be aided by the ability to define each transcript's sequence and splice site repertoire, along with its abundance in a given sample.
"Whether it's PacBio or any of the other long-read technologies … it's going to be important to both capture the whole transcript and the quantification — ideally at the same time and in as unbiased a way as possible," Snyder said.
It's possible that fragmentation-based approaches being developed could eventually reveal the same sorts of transcript information if investigators can come up with newer and more accurate strategies for piecing fragmented sequences back together, Tilgner noted, but that remains to be seen.
As they look ahead to studies focused on specific tissue types — including both healthy tissues and tumor samples — the researchers are now considering ways of doing single-molecule, long-read sequencing on RNA from small somatic tissue samples.
There is also interest in looking at whether any of the newly detected lncRNAs in more refined transcriptome studies will coincide with sites identified in genome-wide association studies of disease, Sharon said.