Skip to main content

JHU Researchers Develop New RNA-Seq Assembly Method that Offers More Complete, Accurate Transcripts Reconstructions


NEW YORK (GenomeWeb) – Researchers from Johns Hopkins University have developed a new computational tool for reconstructing transcripts from RNA sequencing that they say is faster and generates more complete transcriptome assemblies than is possible with existing programs.

According to a paper published in Nature Biotechnology, the so-called StringTie software combines techniques from optimization theory with concepts from de novo genome assembly to reassemble reads from transcriptome sequencing experiments. Its approach to assembly results in more complete and accurate reconstructions of genes compared to other transcript assembly programs such as Cufflinks, IsoLasso, Scripture, and Traph, the researchers wrote. In tests involving multiple simulated and real datasets, they report that the software correctly identified 36-60 percent more transcripts than the second best performing assembler and produced expression levels that were much closer to the actual values. 

StringTie was created in the laboratory of Steven Salzberg, a JHU professor of biomedical engineering, computer science, and biostatistics. It was developed by Mihaela Pertea, a computer scientist by training and an assistant professor in JHU's Institute of Genetic Medicine. She is the first author on the paper. Members of Salzberg's lab are also responsible for developing popular RNA-seq tools such as Cufflinks, which was created in collaboration with researchers at the University of California Berkeley and is probably the most widely used software for assembling transcripts and estimating their abundances. Other well-known tools from the JHU lab include Bowtie, which is a fast, memory-efficient short read aligner, and TopHat, which is a splice junction mapper for RNA-seq reads. 

StringTie — named to fit in with the clothing theme and also a reference to the name given to a sequence of characters in computer science — uses a network flow algorithm and concepts from de novo genome assembly to improve transcript assembly. Specifically, it "groups the reads into clusters, then creates a splice graph for each cluster from which it identifies transcripts, and then for each transcript it creates a separate flow network to estimate its expression level using a maximum flow algorithm," according to the paper.

A way to think about StringTie is to think of a network of water pipes with multiple bends and forks, Salzberg explained in interview with GenomeWeb. It's possible to look at the structure and compute the largest quantity of water that can be forced through to pipes, keeping in mind the constraint that the same amount of water that goes in one end of the pipe has to come out at the other side.  Bringing the analogy home to StringTie, "We think of these exons like little pipes, and we want to assign as many reads as we can to each one ... in order to maximize the flow," he said.

Key to its improved performance is StringTie's ability to take into account the depth of coverage of each isoform during the assembly process, Salzberg said. Methods like Cufflinks split this step, first figuring out which transcripts are present and then estimating their expression levels separately — but performing those two steps simultaneously results in much better assemblies. As the researchers note in the paper, "When assembling a whole genome, coverage is a crucial parameter that must be used to constrain the algorithm; otherwise an assembler may incorrectly collapse repetitive sequences. Similarly, when assembling a transcript, each exon within an isoform should have similar coverage, and ignoring this parameter may produce sets of transcripts that are parsimonious but wrong."

Like other transcript assembly programs, StringTie takes in spliced read alignments as its primary input, but it also takes in additional sets of  pre-aligned contigs — referred to as super-reads in the paper — which are assembled from shorter read pairs. These super-reads can be assembled from shorter reads that come from gene fragments that don't contain repetitive sequences and have no alternatively spliced portions. So, for example, if a pair of reads each 100 bases long were sequenced from opposite ends of an RNA fragment that is 300 base pairs long, barring any splice variants in the sequence between the two shorter reads, they can be merged into a longer read and used instead of the two shorter reads. 

According to the paper, using the super-reads only offer "a modest additional improvement in accuracy" but that's because, according to Salzberg, the researchers adopted a conservative approach to merging the reads, essentially only generating a super-read if the read pairs in question were within the same exon or in different exons that were included in the same isoform — they did this to avoid mis-assemblies. The team is working on expanding the software to make use of much longer super-reads, he said.

Compared with other transcript assembly methods, StringTie proved "substantially more accurate at both assembly and quantitation of gene transcripts, recovering more expressed transcripts while demonstrating higher precision," according to the Nature Biotech paper. The paper includes the results of comparison tests between StringTie and multiple transcript assembly packages on three human RNA-sequencing datasets from the ENCODE project and an internally generated sample from a kidney cell line; and on two simulated datasets.

In tests using 90 million reads from a human sample, StringTie correctly assembled nearly 11,000 transcripts compared to the next best assembly of just over 7,000 transcripts generated by Cufflinks — a 53 percent increase in transcripts assembled. On the simulated datasets, StringTie assembled over 7,500 transcripts compared to roughly 6,300 transcripts that Cufflinks assembled — about 20 percent more transcripts assembled than its nearest competitor. The tests with simulated data also showed that the expression levels that StringTie generated were much closer to the true expression levels than those generate by competing solutions.

According to the developers, StringTie also returns results faster than competing transcript assembly solutions like Cufflinks. StringTie required less than 30 minutes to assemble the two simulated datasets while the other four programs tested as part of the study required between 81 minutes to 48 hours to complete their assemblies. On the real datasets, StringTie required between 35 and 76 minutes to complete its analysis, over three times faster than the closest competing program. StringTie also had some of the smallest memory requirements of all the programs tested, needing between 1.6 gigabytes and 12GB of memory compared to between 6.4 and 26.6GB required by Cufflinks, IsoLasso, and Scripture, according to the paper. 

The Scan

Possibly as Transmissible

Officials in the UK say the B.1.617.2 variant of SARS-CoV-2 may be as transmitted as easily as the B.1.1.7 variant that was identified in the UK, New Scientist reports.

Gene Therapy for SCID 'Encouraging'

The Associated Press reports that a gene therapy appears to be effective in treating severe combined immunodeficiency syndrome.

To Watch the Variants

Scientists told US lawmakers that SARS-CoV-2 variants need to be better monitored, the New York Times reports.

Nature Papers Present Nautilus Genome, Tool to Analyze Single-Cell Data, More

In Nature this week: nautilus genome gives peek into its evolution, computational tool to analyze single-cell ATAC-seq data, and more.