NEW YORK – Riding the wave of recent technological advancements in long-read sequencing, single-cell RNA isoform analysis has not only become feasible but is snowballing into a thriving field.
The approach, based on the ability of long-read platforms to capture full-length transcriptomes, allows scientists to take a closer looker at mRNA isoforms at the single-cell level, which has largely been impossible with short-read sequencing. By doing so, researchers are hoping to better grasp cell type-specific isoform diversity and the role of isoforms in disease and biology.
"Given the improved productivity of both Oxford Nanopore and PacBio, I think we're entering the stage where we're going to be able to get a lot of isoform data from single-cell or spatial sequencing, which is going to change the game," said Winston Timp, a biomedical engineering professor at Johns Hopkins University whose lab has been using long reads to investigate single-cell isoforms in neurons.
Derived from the same gene but differing in sequence, mRNA isoforms often serve as a way for a gene to diversify its protein-coding capacities, tapping mechanisms such as alternative splicing and intron retention.
"Pretty much every gene can generate multiple isoforms,"said Hagen Tilgner, a professor at Weill Cornell Medicine who is a pioneer in the single-cell isoform analysis field.
Borrowing the analogy of cooking, Tilgner said mRNA isoforms can be considered as variations of a recipe to prepare different proteins. While some tweaks can be minor, such as adding a pinch more salt, he explained, other changes may be more substantial, such as substituting beef with tofu.
Because there is currently "a big debate in the field about how many of these distinct isoforms are actually functionally different," he added, it is important to first drill down on how many isoforms are out there and figure out which ones are implicated in disease such as cancer or Alzheimer's.
Previously developed tools have allowed researchers to extrapolate isoform information from short-read data. However, these approaches, which typically rely on computational algorithms, often come with limitations. Since short reads generally cannot traverse multiple splice boundaries, for example, it is very hard to confidently say which isoform a particular read came from.
"It's always easier if you could just see it directly," Timp said. "If you could just sequence the whole transcript, that is always going be better than having to do complex inference and modeling from the short-read data."
He noted that the 10x Genomics platform already generates full-length cDNA molecules for single-cell transcriptome analysis before chopping them up to make short-read sequencing libraries. Therefore, it is a natural fit to deploy Pacific Biosciences or Oxford Nanopore Technologies platforms to study 10x single-cell libraries, he said.
Pushing the throughput boundary for PacBio
Even though PacBio has been offering an Iso-Seq workflow for years that is capable of capturing full-length RNA isoforms, the throughput of the company’s platform had been a bottleneck for pushing the application into single cells.
"PacBio's yield was too low previously," Timp said. "Even with the Sequel IIe, you would get on the order of a couple million reads at most."
However, Timp noted that one way PacBio overcame this problem was by adopting a method named Multiplexed Arrays Sequencing (MAS-Seq), a cDNA concatenation-based approach that was originally developed by Aziz Al'Khafaji's team at the Broad Institute of Harvard and MIT.
"For things like RNA sequencing, you need quite a number of reads," said Al'Khafaji, who helps lead genomics technology development at the Broad Institute. "As time went on, [PacBio's] read throughput started increasing, but it was still pretty far away from where it needed to be."
Even so, compared with short-read sequencing's "pretty steep barrier" to achieving full-length isoforms analysis, Al'Khafaji said, he believes it was "much more tractable to fix the throughput problem on long-read sequencers, particularly PacBio."
His group developed MAS-Seq, which concatenates cDNA molecules from single-cell platforms for long-read sequencing, producing data that can be bioinformatically converted to the original molecules. The approach effectively results in substantially higher throughput while reducing the amount of sequencing needed.
MAS-Seq takes advantage of PacBio's "sweet spot for sequencing 15-kB to 20-kB [DNA] molecules," Al'Khafaji said. "It turns out that the lion's share of isoforms falls below 3 kB and 4 kB," he explained. By concatenating cDNAs into arrays of molecules 15 kB to 20 kB in size,"you essentially boost the output of your sequencing run by the number of cDNAs that you can stitch into the array," he said.
PacBio has licensed the MAS-Seq technology from the Broad Institute and commercialized it into a kit, which is compatible with the 3' single-cell cDNA library generated using the 10x Chromium platform. According to PacBio, the MAS-Seq kit can deliver a 16-fold increase in sequencing throughput and can be used on its Sequel II systems as well as the new Revio platform. Additionally, MAS-Seq single-cell data can be analyzed using the firm's SMRT Link analysis software.
Al'Khafaji noted MAS-Seq is also compatible with PacBio's recently launched Revio sequencer, which promises to improve the HiFi read throughput by more than an order of magnitude. "Revio doesn't decrease the utility of MAS-Seq, it only opens more doors," he said. "People will run more samples that they never would have before."
Improving sequencing accuracy for Oxford Nanopore Technologies
While PacBio sequencing has been hampered by throughput concerns, sequencing accuracy has historically posed a hurdle for Oxford Nanopore Technologies when it comes to scRNA-seq data.
"The problem with nanopore and 10x data has been, up to this point, that the barcodes and the [unique molecular identifiers] are designed for Illumina sequencing," Timp noted. "They're designed for very accurate reads," so demultiplexing has been difficult with nanopore data.
However, with Oxford Nanopore rolling out its newest Q20+ chemistry, Timp said, "the latest accuracy for nanopore is now good enough that you could do things" using appropriate tools.
"ONT has been working on the accuracy quite a bit," agreed Christopher Vollmers, a biomolecular engineering professor at the University of California, Santa Cruz whose lab has been developing tools for single-cell isoform analysis. "The new R10 flow cells that they are selling now are actually really good."
An early-access customer for Oxford Nanopore, Vollmers said even though the company’s latest product does not "quite get a median accuracy of 99 percent" in his lab, it is "getting close."
Median accuracy has been 98.7 percent in his lab, he said, which is "definitely more accurate than the R9 flow cells were, and the throughput doesn't seem to be taking a hit."
Previously, to compensate for the shortfall of single-read accuracy of the Oxford Nanopore platforms, Vollmers’s team developed a method dubbed Rolling Circle Amplification to Concatemeric Consensus (R2C2).
A well-known method within the community, R2C2 circularizes cDNA molecules using Gibson assembly and amplifies them by rolling circle amplification. The resulting libraries are then sequenced leveraging the long-read capability of nanopore sequencing to generate consensus reads with increased base accuracy. This, in turn, enables researchers to carry out single-cell isoform analysis accurately and cost-effectively using the Oxford Nanopore platforms.
As Oxford Nanopore's technology continues to improve, Vollmers noted, the quality of R2C2 reads has been getting better and better. "With nanopore raw accuracy going up over the years, so has our R2C2 accuracy," he said. "That means, these days, our R2C2 method actually is as or more accurate than Iso-Seq reads."
In addition, while the accuracy of nanopore sequencing is "probably good enough to do really good isoform analysis," Vollmers said, the platform is still "heavily biased towards the shorter molecules." As a result, the molecules analyzed by the sequencer are not necessarily representative of the sample, he noted, but R2C2 can help mitigate this shortcome by "hiding the actual length of the cDNA from the nanopore sequencer."
Despite R2C2's capability to effectively transform an Oxford Nanopore platform into a consensus-style sequencer — similar to PacBio — for cDNA analysis, Vollmers said the method, which is supported by his academic lab but has not been turned into a kit so far, can still be "a bit rough around the edges" compared with a fully commercialized product such as MAS-Seq.
"It's still mostly me just trying to support people adopting [R2C2]," he said. "Having a company behind [a product] and customer support, and the ability to return a kit if it doesn't work for you, I think it's worth a lot."
Downstream data analysis
While emerging single-cell isoform sequencing workflows unlock new insights, scientists also face the task of developing analysis pipelines that can accurately interpret these data.
To that end, Vollmers' group has developed an isoform identification software named Mandalorian, which is capable of processing single-cell RNA-seq data from both the R2C2 method generated with nanopore sequencers and Iso-Seq data produced by PacBio.
Still, as the field continues to expand, Vollmers said, data analysis will likely become a computational challenge, especially given that researchers are now dealing with a new type of data.
"The promise of long-read sequencing, in a sense, was that we really just sequence the molecule straight as it is …, so everybody thought [the analysis] was going to be easy," Weill Cornell's Tilgner said. "But it turns out that is actually not the case."
"The fundamental problem is that we never really had a ground truth to know what the algorithms should be achieving," he added.
In January, Tilgner's team published an analysis tool in Nature Biotechnology called IsoQuant, which is developed to help determine isoforms using long-read RNA-seq data using intron graphs.
The method "turned out to be working very well" and has already demonstrated it can process data at the single-cell level, said Andrey Prjibelski, first author on the paper and Tilgner's collaborator at the University of Helsinki in Finland.
For novel transcript discovery, the researchers showed in their study, IsoQuant can reduce the false-positive rate for Oxford Nanopore data fivefold when paired with reference genome annotation and 2.5-fold in the reference-free mode. Similarly, the algorithm boosted the performance of PacBio data, the authors observed.
The IsoQuant method was built upon knowledge attained from the team's previous study, which explored platform-specific error patterns of both PacBio and Oxford Nanopore instruments when sequencing individual cDNA molecules.
"If you have absolutely random error, maybe you can just avoid it statistically," Prjibelski noted. "But if you have a certain bias at some positions of the genome where you have a certain motif, which complicates the alignment, then it becomes more tricky."
"My take on it was that we really needed something that would put emphasis on precision," Tilgner said. "By precision, I mean to make sure that you don't report every sequencing error as a new isoform."
Figuring out potential pitfalls
Beyond platform-specific sequencing errors, there is also a need to systematically study other potential pitfalls of single-cell isoform analysis.
"We have to answer serious questions about biases that might be creeping in when using the single-molecule sequencing tools for single cells," Timp said. "I don't think that people have done thorough controlled studies to see … what we might be missing or what other biases might be happening."
"This is something we have been dealing with," said Vollmers. "If long-read sequencing, PacBio or ONT, is the only way to sequence a full-length transcript, how would you validate that that transcript is real?"
Efforts in the community are underway, however, to help bridge this knowledge gap. For instance, a group of investigators, including Vollmers, have formed the Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) consortium.
Led by UC Santa Cruz professor Angela Brooks, LRGASP hopes to conquer the challenges of long-read sequencing for transcriptomics analysis by looking at different parts of the workflow, such as library preparation, sequencing platform, and computational analysis tools. Preliminary results are currently available on the Nature Methods server as a registered report.
By proceeding cautiously and methodically, researchers who are early adopters of single-cell RNA isoform analysis are hoping to lay the groundwork for uptake by the broader community as the field continues to thrive.
"We have established that this works," Vollmers said. "Once the technology is stable enough, then labs can just pick it up – that's when it gets really exciting."