NEW YORK (GenomeWeb) – Researchers from Stanford University have modified Illumina's TruSeq Synthetic Long Read library prep method to work with RNA sequencing and demonstrated that the technique can sequence full-length cDNAs, including previously undescribed isoforms, with high accuracy.
The researchers call their method SLR-RNA-seq because it is based off of the Moleculo synthetic long read sequencing technology, that Illumina acquired in 2013 and now markets as its TruSeq Synthetic Long Read kit. The technique was published today in Nature Biotechnology.
"The transcriptome is extremely incomplete using present day RNA-seq," senior author Mike Snyder, told GenomeWeb. "It's great for getting overall gene expression, but when you break it into bits, you really don't assemble it back properly," he said.
For instance, most genes have multiple transcripts, and without long reads it's hard to know which exons go with which transcripts. Long reads are critical both for piecing together transcript structure and also for keeping allele and edit information, Snyder said.
Previously, the researchers had tested transcriptome sequencing with longer reads on Roche's 454 and also Pacific Biosciences technology. But, the 454 platform "lacked the ability to provide full-length sequences for each mRNA molecule" and using PacBio, "it was difficult to generate large enough sequences," the authors wrote, "so that comprehensive transcript diversity could not be deduced with statistical confidence, particularly for transcripts of low abundance."
In the Nature Biotech study, the researchers slightly modified the standard Moleculo protocol, first preparing single-stranded cDNA molecules with PCR-primer sites on each one. They then added those to a 384-well plate and amplified each molecule. Next, the researchers fragmented and barcoded the resulting double-stranded molecules and sequenced them on the Illumina HiSeq 2000 with 125-base paired-end reads. The reads were then grouped by well and assembled into synthetic long reads.
Hagen Tilgner, lead author of the study, told GenomeWeb that the team essentially made two modifications to the standard Illumina protocol, slightly adjusting it to ensure that multiple RNA molecules from the same gene ended up in the same well, and designing cDNA molecules with PCR primers on the ends.
To validate the method, the researchers first tested it using the External Research Controls Consortium control RNAs, which vary in size up to 2 kb in length. The mixture was spiked into a mouse brain sample and the team produced libraries containing 3.7-million mouse brain and 19,000 ERCC synthetic long reads. They mapped the SLRs to the mouse genome and the known ERCC sequences and compared the results to the original ERCC sequences and also to ERCC reads they previously sequenced using PacBio's circular consensus sequencing (CCS).
They found that the SLRs had fewer indels than the PacBio reads — 96 percent of SLRs were indel-free, compared to 5.5 percent of PacBio reads. However, the SLRs had more missing nucleotides on the 5' and 3' ends than the PacBio reads.
Next, the team wanted to test the protocol on even longer, more complex RNA molecules. They generated 3.7 million SLRs from the mouse brain and 5.2 million SLRs from human brain and compared them to PacBio-CCS reads from human organs and a human cell line, work that they published last year in the Proceedings of the National Academy of Sciences.
They generated SLRs with an average length of 1,907 bp for the human brain sample, and 1,849 bp for the mouse brain sample, which were longer than the PacBio-CCS reads they generated, which averaged 1,289 bp. The technologies were comparable when looking at the percentage of molecules that represented full-length transcripts, both averaging between 61 percent and 64 percent.
The SLR method also identified a large number of novel isoforms — around 14.5 percent of the human brain spliced reads, corresponding to 13,800 genes, and around 18 percent of the mouse reads, corresponding to 8,600 genes, had novel splice-site combinations.
Novel isoforms were found more often in pseudogenes and lncRNAs than in protein coding genes. The majority of detected spliced genes in all three categories had at least one novel isoform, including 86 percent of protein-coding genes, 86 percent of lncRNAs, and 91 percent of pseudogenes from the human sample.
In addition, the researchers were able to use the SLR data to evaluate the relationship between alternative exons as well as between the human and mouse data. For instance, they found that nine distant molecularly associated exon pairs were shared between the mouse and human data.
"We can deduce all sorts of variation that occurs along the molecules," Tilgner said. For instance, "everything that varies in RNA molecules, we can now connect to the other sites on the RNA molecules and ask whether that is dependent or independent."
Transcriptome sequencing with long reads is a huge advantage over sequencing with short reads, Snyder said, not only because it enables the sequencing of full-length transcripts, but also because it enables transcript quantification, he said.
Snyder added that compared to the PacBio technology, the main advantage of the SLR technique is throughput. "This can be done on one flowcell of a HiSeq," he said.
Tilgen said that getting 50 to 100 reads per gene is important to enable statistical analysis of each gene. The necessary throughput is possible using PacBio's targeted Iso-Seq strategy — and indeed the group used PacBio to validate many of its findings in this study — but it is still not high enough for a whole human transcriptome, he said.
Snyder said that the team now plans to continue to use the SLR method to match transcripts with their start sites and poly-A sites, and to map allele information — "basically every variation you can think of."
"We now have pretty much full-length transcripts and we have high-quality sequence," Tilgen said, "so, we can deduce all sorts of variation that occurs along the molecules."