Pacific Biosciences is expanding its technology into the realms of transcriptome sequencing and human genome sequencing, the company and its customers demonstrated at the Advances in Genome Biology and Technology meeting held in Marco Island, Fla., earlier this month.
As read lengths and throughput of the PacBio RS II continue to increase, users are increasingly finding utility beyond microbial sequencing. Sean McGrath from the Genome Institute at Washington University presented data from using the PacBio to do RNA-seq for gene prediction and identifying novel isoforms. During a company workshop, PacBio's Chief Scientific Officer Jonas Korlach presented data on the system's ability to identify novel isoforms and alternative splice sites. Additionally, the company's next software release, scheduled for the second quarter, will include Iso-Seq for analyzing RNA-seq data.
In addition, the technology is becoming increasingly useful for larger genomes — and PacBio recently demonstrated this with the release of data from the de novo sequencing and assembly of a human genome to 54x coverage using only reads generated from its RS system. The company also sequenced a well-characterized human cell line that is being used as part of a National Institutes of Health project to generate an alternate reference genome. The sequencing generated over 21 million reads with an average read length of 7,680 bp with the longest read stretching to over 42 kb.
The de novo assembly produced a contig N50 of 4.38 mb, with the longest contig 44 mb. By comparison, the most recent reference-guided assembly using Illumina sequencing and BAC-clone finishing of the same sample had a contig N50 of 144 kb. Even the 2007 Sanger-based human reference is much more fragmented than the PacBio assembly, with a contig N50 of 107 kb.
RNA-seq identifies many novel isoforms
At AGBT, Korlach also highlighted the PacBio technology's ability to do transcriptome sequencing and said that the next software release, SMRT Analysis 2.2, which is scheduled for release in the second quarter, would include the Iso-Seq bioinformatics pipeline for analyzing transcriptome data.
To do transcriptome sequencing on the PacBio system, Korlach told In Sequence that standard RNA-seq sample-prep protocols could easily be adapted. The company recommends starting with polyA RNA and using the Clontech SMARTer PCR cDNA synthesis kit. Users can also use an Invitrogen protocol, Korlach said, but that requires a higher input, compared to 10 ng for the Clontech protocol.
After creating the cDNA, users can do an optional size selection step, either by running gels or using Sage Science's BluePippin to select for the longer molecules. For size selection, Korlach said the company recommends three size selection bins — a 1-kb to 2-kb bin, a 2-kb to 3-kb bin and a bin for cDNAs above 3 kb. After sample prep, the cDNAs are loaded onto the sequencer. The bioinformatics pipeline is similar to the company's HGAP and self-correction algorithms. The reads are clustered in such a way that similar isoforms are grouped together, and then consensus is built, Korlach said. The method is currently available on PacBio's DevNet site and will be included in the next software release.
Similar to HGAP, the method uses the longest reads as backbone reads, and the shorter reads are aligned against that backbone to create consensus and error correct the isoforms.
Korlach said that PacBio transcriptome sequencing would have two major applications. First, the technique can be used to create "much better reference transcriptomes," he said. Expression studies that evaluate abundance are "remapping to what you think is there," he said. "But if that reference is incomplete, then you're missing a lot of biology."
A second application of RNA-seq on the PacBio platform will be to identify alternative splice sites and alternative splice junctions that affect disease, he said.
During a presentation at AGBT, Wash U's McGrath described how his group first tested the protocol on a human metastatic prostate cancer cell line. Not only was the group able to sequence full-length isoforms, but they also identified splice variants and novel isoforms, despite testing the technique on a well-characterized cell line. "We were happily surprised that we were able to identify [isoforms] in that genome that hadn't been identified with Illumina and 454 data," McGrath said.
In a follow up interview with IS, Vince Magrini, a research assistant professor at Wash U's Genome Institute, said that doing RNA-seq on the PacBio system has a "huge advantage" over short-read sequencing systems in that it "maintains phasing for haplotype information," which is especially important for compound mutations.
He added that while the technology would not replace short-read RNA-seq experiments, the two technologies are complementary to each other.
Moving forward, McGrath said that the Wash U team is continuing to refine the RNA-seq protocols for the PacBio system, including improving the sample prep protocols.
For instance, the Clontech SMARTer kit, while it enables great resolution at the 3' end of the transcript, resolution declines at the 5' end, particularly for long transcripts, McGrath said. Additionally, he said that his group is interested in doing targeted RNA-seq, so it has been working to modify the SMARTer protocol to use targeted primers.
During a presentation at AGBT, Korlach highlighted other customers' use of the PacBio for transcriptome sequencing. For instance, a group from Stanford University, led by Wing Wong, used both the PacBio RS and Illumina technology to sequence the transcriptome of a well-characterized human embryonic stem cell line.
The group detected 8,084 full-length annotated isoforms and predicted an additional 5,459 isoforms through statistical methods. More than one-third of the more than 13,000 isoforms were novel. In addition, they discovered 273 RNAs from gene loci that have not previously been identified. The Stanford group published their results in the Proceedings of the National Academy of Sciences.
In another study, led by Mike Snyder from Stanford University and published in Nature Biotechnology, transcriptome sequencing from 20 pooled human organ and tissue samples identified around 14,000 isoforms, over 10 percent of which were novel. Because of the long reads of PacBio, Korlach said that in many cases, the researchers were able to sequence through one entire RNA molecule, and could readily identify alternative splice sites.
"The number of different isoforms far exceeded the existing annotation for those particular genes," Korlach said.