By Monica Heger
This article was originally published Aug. 15.
Assembling transcriptomes without a reference can be tricky, but some researchers are now turning to de novo assembly from RNA-seq data to better understand the biology of organisms while avoiding the higher costs of whole-genome sequencing.
Researchers from Aviv Regev's lab at the Broad Institute and Nir Friedman's lab at the Hebrew University in Jerusalem recently developed a method for de novo assembly from RNA-seq reads, which they published in Nature Biotechnology.
Researchers from the University of Maryland and elsewhere are using the method to understand biological diversity among closely related species without reference genomes.
The technique, dubbed Trinity, differs from transcriptome reconstruction methods such as Scripture and Cufflinks that "heavily rely on mapping reads to a reference and then doing an assembly," said Moran Yassour, a co-lead author of the Nature Biotechnology study.
Unlike such mapping-first techniques, Trinity takes an assembly-first approach. While mapping-first techniques are simpler because researchers need to simply align sequences to a reference, those techniques are biased toward the reference genome, so are more likely to miss alternative splice sites and differential and rare transcripts. Additionally, in cases where there is not a reference genome, or a high-quality reference genome, the mapping-first techniques will not work.
In cases where there is not a reference, "you can't do much with the single [RNA-seq] reads," Yassour said. "But, if you assemble the reads and get a longer transcript, you can then map those to a closely related reference or an unfinished reference."
The technique can also be applied to transcriptome sequencing of cancer samples, allowing researchers to identify "aberrant transcripts," Yassour added.
Trinity has three main components. The first takes the initial RNA-seq reads and creates longer, linear sequences.
The second step incorporates a de Bruin graph, the same algorithm used in whole-genome assemblers such as Velvet. The de Bruijn graph is a computational method that "enumerates all possible solutions," the authors wrote in the study. "For transcriptome assembly, each path in the graph represents a possible transcript."
Finally, the third component, an algorithm called Butterfly, takes all the different graphs and reconstructs them into transcripts, said Yassour. A scoring scheme is applied to compute the "plausible" transcripts and discard the "nonsensical solutions," according to the study.
This step extracts the longer sequences that correspond to transcripts, finds the splice isoforms, and is able to give an "estimation about the abundance of each isoform," Yassour explained.
Additionally, if there is not a reference genome, "when we do the assembly, different paralogs will assemble together because they have regions in the sequence that are similar and regions that are different. So, two different genes could assemble into the same graph," Yassour explained. This third tool within Trinity is where those graphs are disentangled.
The main challenge of the assembly-first method is in discerning whether a variant is a true variant or a sequencing error. "If you have the reference genome, you can say that if it doesn't match, it's not real," Yassour said. But, on the other hand, "you don't want to be biased toward a reference genome."
To get around this problem, Trinity incorporates a model to determine the likelihood that a variant is real or not by examining variants from the same graph.
[ pagebreak ]
Currently, the team is using Trinity for transcriptome sequencing experiments run on the Illumina Genome Analyzer. They've tested it with 76-base pair read lengths and 51-base pair read lengths. Longer read lengths and paired end reads tend to yield better results, said Yassour.
When the team tested the method on yeast, they found that from 50 million paired-end reads, they were able to fully reconstruct 86 percent of the annotated transcripts. Of the 276 transcripts that were not fully reconstructed, around one-third were constructed to over 90 percent of their length, and 64 percent were reconstructed to at least half their length. The authors said that Trinity outperformed all assembly-first and mapping-first methods in reconstructing the yeast transcriptome.
For the mouse transcriptome, Trinity reconstructed significantly more transcripts to full length compared to other de novo methods, such as Abyss, Trans-Abyss, and SOAPdenovo. Trinity reconstructed 8,185 transcripts, compared to 7,025 by Trans-Abyss, 5,561 by Abyss, and 761 by SOAPdenovo. The mapping-first approaches were more sensitive, however, as Cufflinks reconstructed 9,010 transcripts and Scripture reconstructed 9,086.
Charles Delwiche, a professor of cell biology and molecular genetics at the University of Maryland, is using transcriptome sequencing and assembly with Trinity to understand biological diversity in protists, and in particular, algae.
"We're working on organisms that are not model organisms and are poorly understood," he told In Sequence.
For those organisms, such as dinoflagellates, it doesn't always make sense, nor is there always the funding to do whole-genome sequencing.
"We can't have a multimillion-dollar genome project on every little obscure bug that might happen to swim by in the ocean," he noted. Instead, Delwiche's team uses transcriptome sequencing, which "limits our study to a subset of the genome, but one that has a lot of biological interest, the expressed genes."
His group has tested transcriptome sequencing and assembly using Sanger sequencing, Roche's 454 GS FLX, and the Illumina Genome Analyzer, and presented data from a National Science Foundation-funded study at a conference hosted by BGI this June.
Focusing on an algal species called Polarella glacialis, Delwiche's team used both 454 and Illumina to sequence the transcriptome. While 454 sequencing produced longer reads, making assembly less complicated, Illumina produced much more data and was able to deal with homopolymeric repeats better than the 454, he said.
Using 75-base single-end reads on the Illumina, the team generated around 4.9 gigabases of data, compared to 48.2 megabases using the 454. After sequencing, the team next tried to assemble the individual transcripts using the Trinity assembler, which is "designed to work well with transcriptome sequence," Delwiche said.
The deep sequencing from the Illumina GA plus the Trinity assembler allowed the team to assemble a number of large transcripts, including a 7.6-kilobase transcript for the protein dynein, which acts as a molecular motor. "Dynein is encoded by a very large gene," said Delwiche. "I was surprised by what a large transcript we were able to see with the Illumina data. … We actually got one assembly for dynein that was around 11 kilobases long, from 75-base reads."
Moving forward, Delwiche said that he will continue to do de novo transcriptome sequencing and assembly using Illumina and Trinity, and will also continue to evaluate other technologies, including the Pacific Biosciences RS.