NEW YORK – A new long-read RNA sequencing benchmarking study sheds light on the strengths and weaknesses of different transcriptome analysis workflows involving various library preparation protocols, sequencing platforms, and analysis tools.
Led by an international coalition of RNA researchers, the benchmarking initiative, dubbed Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium, generated over 427 million long-read sequences using both the Pacific Biosciences and Oxford Nanopore Technologies platforms. In addition, it systematically evaluated computational methods submitted by over a dozen tool developers.
The results of the consortium's analysis, which aims to establish best practices for long-read RNA-seq studies, appeared as a preprint in BioRxiv this summer.
"Many people want to use long-read RNA-seq, but they don’t know the best practices," said Kin Fai Au, a professor of computational medicine and bioinformatics at the University of Michigan and one of the organizers of the LRGASP Consortium. "While different companies promote their sequencing platforms as the best…, scientists need a fairer comparison of the technologies for [their] specific research goals."
Modeled after the RNA-Seq Genome Annotation Assessment Project (RGASP), a previous large-scale benchmarking effort for short-read RNA sequencing, LRGASP proposed three main challenges for the research community to tackle: transcript isoform detection for a well-curated eukaryotic genome, transcript isoform quantification, and de novo transcript isoform identification without a high-quality annotated genome in non-model organisms.
Through these challenges, LRGASP organizers were hoping to identify those combinations of wet lab and bioinformatic strategies that perform best in each research scenario.
The challenges represent the most popular applications of long-read RNA-seq among the research community, Au pointed out. "That's why we set these three challenges as our primary goals."
The organizers generated the RNA data for each of the challenges using a variety of sample types and sequencing approaches. For the first two challenges, they employed human and mouse cell lines with extensive available chromatin-level functional data from the Encyclopedia of DNA Elements (ENCODE) project. The samples were cultured as biological triplicates and spiked with 5’-capped RNA variants produced by Lexogen as controls.
For challenge three, which involved de novo isoform identification, the researchers used a single pooled whole-blood sample from the manatee, the genome of which is not well characterized yet.
Library prep methods included cDNA, direct RNA, Rolling Circle Amplification to Concatemeric Consensus (R2C2) to improve the accuracy for nanopore sequencing, as well as CapTrap, a cDNA library preparation method designed to detect 5’-capped, full-length transcripts. The libraries were then sequenced using PacBio Sequel II and Oxford Nanopore MinIon platforms, as well as on Illumina sequencers as a control.
"I think what is unique to this benchmarking effort, compared to other ones, is that there were different sequencing platforms as well as different library prep methods involved," said Angela Brooks, a biomolecular engineering professor at the University of California, Santa Cruz who is another organizer of LRGASP. "There is a huge logistical effort for making all the data available as well as making all the challenges very clear [to the participants]."
To promote participation in LRGASP, the initiative was announced to the broader research community through word of mouth, social media, and the website of GENCODE, which is part of the scaled-up ENCODE project. The goal for the organizers was to recruit as many tool developers into the study as possible, Brooks said.
As a way to ensure fairness and transparency, LRGASP was set up so that participants would submit their predictions to different challenges using data produced by the organizers. Having tool developers provide their own results would likely mean they would use the best-performing parameters for that tool, Brooks noted, making for a fairer comparison.
In the end, 14 labs participated in LRGASP, submitting 141, 143, and 25 transcriptome analysis predictions for challenges one, two, and three, respectively.
To evaluate the submissions, LRGASP organizers employed a cocktail of bioinformatic and experimental approaches. They used SQANTI3, a tool specifically developed for quality control using long-read RNA-seq data, and computed performance metrics based on RNA spike-in controls, simulated data, as well as an undisclosed, manually annotated transcript dataset curated by GENCODE.
Overall, the study unveiled "significant differences" between sequencing platforms and between library preparation methods in terms of the number and quality of reads, according to the researchers. While Oxford Nanopore cDNA sequencing of CapTrap libraries produced 10 times more reads than other platform-protocol combinations, PacBio cDNA sequencing and the R2C2 method using Oxford Nanopore provided the longest and most accurate reads.
"Interestingly, more reads did not consistently lead to more transcripts, indicating that read quality and length are important factors for transcript identification," the study authors wrote.
Additionally, they observed a noticeable influence of the analysis tools on the results and identified "fundamental differences" in the strategies of different algorithms.
Based on these results, the LRGASP organizers came up with several recommendations to improve transcriptome analysis using long-read RNA-seq.
For one, they recommended that longer and more accurate sequencing reads are preferable over more reads when it comes to transcript identification. PacBio cDNA sequencing and Oxford Nanopore sequencing using the R2C2 libraries are the best options for the task, they said. However, if the goal is quantification, especially if the analysis is based on an annotated reference, cDNA nanopore sequencing would be the best choice.
"This has actually helped in my own research," Brooks said. "Moving forward, when we want to do differential expression analysis, we will try to balance between getting more reads and getting the longer, more accurate sequences."
As for choosing an appropriate bioinformatics analysis tool, the researchers concluded that it is crucial to consider the study’s objective.
For instance, if the goal is to profile sample-specific transcriptomes using a well-annotated genome, especially when only minimal novel transcripts are expected, Bambu, IsoQuant, and FLAIR are the most effective tools, the researchers noted.
Meanwhile, if a study aims to detect lowly expressed or rare transcripts, Mandalorion and FLAIR, combined with short reads, are likely to be the best options. If quantification is essential, the study recommended IsoQuant, IsoTools, and FLAIR are the ideal picks.
In addition, the researchers said that the accuracy of the transcript calls can be improved by combining multiple analysis tools.
"There is no method that you can just say, ‘I'm going to use that one, and that will find me everything," said Adam Frankish, who leads the manual genome annotation team at the European Bioinformatics Institute that produced the GENCODE dataset the LRGASP Consortium used a s reference. "If you want to get a comprehensive annotation, you might have to use different strategies to achieve that."
During the project, the researchers also noticed that many tools detected novel transcripts outside of the annotated reference. "What seems to be possible is that all the methods are giving a view on some set of transcripts outside the reference annotation," Frankish said, "and these things are [experimentally] validated at a reasonable rate."
Meanwhile, Ana Conesa, a research professor at the Spanish National Research Council whose lab developed SQANTI3 and helped carry out the benchmarking efforts for LRGASP, said she would like to put out "a big warning message" to the community about these novel transcripts.
"For me, one of the most interesting follow-ups of this project is how we characterize those transcripts that are rare," she said. "Yeah, for sure, there are many novel things to discover, but it's not that straightforward that the reads will give you this answer if you don't apply the proper analysis with many different layers of control."
Despite their best efforts to recruit researchers to participate, the LRGASP organizers noted that they were unable to evaluate all existing tools in their project. Additionally, with long-read sequencing technologies changing very rapidly, the results of the study may also be evolving.
In fact, both PacBio and Oxford Nanopore have highlighted technological advances they made since the study was conducted.
"We welcome the positive recommendations from the researchers, and as the latest upgraded chemistry and flow cells for our cDNA and direct RNA sequencing kits become more broadly available, we look forward to seeing further enhancements," an Oxford Nanopore spokesperson wrote in an email.
"The study mentions lower PacBio throughput, [but] this is now addressed with Revio and bulk MAS-seq," PacBio CSO Jonas Korlach said in an email, referring to the company’s newest sequencing platform and to Multiplexed Arrays Sequencing, a cDNA concatenation-based approach that was originally developed by Aziz Al'Khafaji's team at the Broad Institute.
With the new Oxford Nanopore R10 flow cell and PacBio Revio with the MAS-seq kit, "at this point, the accuracies and the throughput of both technologies are getting much closer together," said Christopher Vollmers, a biomolecular engineering professor at the University of California, Santa Cruz whose lab developed the R2C2 technology, and an organizer of the LRGASP Consortium. "As a tool developer myself, it was already tricky picking up technology back then, and it has only gotten harder."
In addition to generating the R2C2 data for the consortium, Vollmers and his team also submitted their own analysis tool, Mandalorion, for evaluation, but he did not participate in the assessment process.
The results have helped him improve the software. "We actually went ahead and used the LRGASP data to keep developing the tool," Vollmer said. "We had a benchmark with the other tools [by participating in the LRGASP], we knew where we were standing, so we had that resource to come back to."
"This is a needed study," said Winston Timp, a biomedical engineering professor at Johns Hopkins University. Timp’s lab has been using long-read RNA-seq to investigate isoforms in neurons but was not involved in the LRGASP project.
"Until this point, it was the wild west [for long-read RNA-seq studies], and there is a new sheriff in town from LRGASP, trying to come to grips with all the tools that are out there," he said. "This is a good primer on how to think about the strengths and weaknesses of the field as it stands."