When assembling the transcriptomes of non-model organisms, researchers usually find themselves working without a safety net. Often, there are no reference genomes to help the researchers along in their assemblies, and they find themselves having to put together gene-coding sequences out of short reads while dealing with gene duplications, transcription noise, and other issues.
The problem, says University of Helsinki researcher Christopher Wheat, is "huge." Though many use it, Wheat does not like the term "reference-free assembly" because "no assemblers really use a reference." In fact, he adds, "very few researchers ever quantitatively assess their assembly by comparing it back to a set of genes that should have been assembled. Most people in this situation are working with species that have no assembled transcriptome."
In 2008, Wheat and his colleagues published one of the first papers on transcriptome assembly without a genome reference in Molecular Ecology. They presented a de novo assembly of the Glanville fritillary butterfly transcriptome using 454 pyrosequencing. Since then, others have also devised methods to address this challenge. Wheat's current preferred method is to use the Trinity transcriptome assembler, which was developed by researchers at the Broad Institute and the Hebrew University of Jerusalem and published in Nature Biotechnology in May 2011.
At the end of April, researchers in France and the UK published their own solution to the problem in Molecular Ecology Resources. "Normally, assembling genomes is known to be complex because genomes are long and highly repetitive, and transcriptomes are supposed to be much less of a problem because unique sequences are more common," says the study's senior author Nicolas Galtier of the Université Montpellier. But transcriptome researchers face specific problems of their own, like alternative splicing and being able to cover both high- and low-expression genes, he adds.
For their new study, Galtier says, "we benchmarked existing methods that have been developed for genome assembly, and we just asked, 'Does this method, and this method, and this method perform well when applied to RNA?'" His team assembled the transcriptomes of five diverse non-model animals — ant, hare, oyster, tunicate, and turtle — from newly generated 454 and Illumina reads. They found that they obtained their highest-quality assemblies when they combined 454 and Illumina data. The team analyzed the mixed 454 and Illumina reads with the Abyss and Cap3 assembly programs, and mixed 454 and Illumina contigs with Cap3. They also tested six other assembly programs, including Celera, Trinity, and SOAPdenovo-Trans. In the end, however, they concluded that Abyss and Cap3 were the best-performing assemblers, both qualitatively and quantitatively, of all the methods they tested.
"When you use very high-throughput data, some programs cannot handle big files, so you have to do it in two steps," Galtier says. The first step is to use a program that can start to assemble the transcriptome, and then take that output and use it as input into a second assembly program "which is more accurate, more memory-demanding," he adds. "The two steps is the way we suggest we should proceed."
Of course, the team notes that this method is not suitable for all assemblies. "Our method was to get to the specific aim of correctly assembling high expression genes — we didn't try to identify low-expression genes," Galtier says. "We just focused on what we think is the core transcriptome." For example, this method is useful for identifying SNPs or population genetic markers in a given gene from many species in the same group, or for a transcriptome about which the researcher knows nothing in advance, Galtier adds.