Developers of the Cufflinks RNA-seq analysis software package have released a new version of the Cuffdiff module that can perform differential expression analysis at both the gene level and the transcript level in a single workflow — ultimately helping researchers gain a better understanding of gene regulation from RNA-seq studies.
The developers, from Harvard University, the Broad Institute, Massachusetts Institute of Technology, and the University of California, Berkeley, describe Cuffdiff 2 in a paper published earlier this month in Nature Biotechnology.
In the paper, the developers explained that the new algorithm gives a much clearer image of gene and transcript expression than other expression analysis packages because it is able to manage the statistical challenges associated with obtaining gene and isoform expression values from sequence reads while also accounting for sources of variability in measurements taken from biological replicates.
RNA-seq offers the ability to study expression at the isoform level as well as measure expression changes between conditions in different experiments, but until now, these two “themes” have not been “married well,” Lior Pachter, a computational biologist at UC Berkeley and a co-author on the Nature Biotech paper, explained to BioInform.
Tools such as DESeq and edgeR take a "count-based" approach to differential analysis that assumes that input RNA-seq data "are the number of perfectly and unambiguously mapped fragments that originate from each gene or transcript in each library," the authors wrote. However, such methods don't account for uncertainties that may arise from ambiguous reads or in cases where genes have many isoforms, they said.
Cuffdiff 2 takes a step beyond these methods by modeling variability in the number of fragments generated by each transcript across replicates. "This enables it to dynamically control for uncertainty in highly complex or insufficiently sequenced genes," the authors explained.
Using an illustration, Pachter said that a researcher studying two disease conditions would start by generating data from multiple biological replicates from experiments done for each ailment, as well as multiple sets of RNA-seq reads for the different replicates.
“The procedure we employ is … to take those replicates and assess what the variability in expression is for transcripts as a function of their abundance,” or in other words, “assess the extent of the variability from the replicates themselves,” he explained.
Then “we take the reads from each of these experiments and assess for each transcript how many reads we estimate came from that transcript,” taking into account ambiguous mapping.
Finally, “we employ a statistical procedure, which is in effect to mix together these sources of uncertainty in a statistically sound way to produce an overall estimate of how much variability we would expect to see in the reads if we are doing replicates and have uncertain features in mapping,” Pachter said.
“This gives us an authority to compare different conditions because we have an understanding of how much variability we expect in each condition so that we can say whether these two conditions look like they’ve changed by a statistically significant amount,” he said.
The developers expect the new algorithm, which is available in the Cufflinks software package (BI 5/21/2010), to gain significant traction in the research community as a result of the increasing use of RNA-seq technologies for differential expression.
“RNA-seq is becoming a really widely used assay … partly because it just provides finer resolution than microarrays," and also because it is now cost-competitive with arrays, Pachter told BioInform.
Currently “there are many, many studies where expression is a good proxy for the condition or state of a cell” and with Cuffdiff 2, researchers now have a tool to “assess whether expression has changed not just in genes but for individual transcripts to understand better how both transcription and post-transcriptional regulation is affecting your system,” he said.
The Nature Biotech paper notes that some approaches used to measure isoform-level expression “ignore the variability across biological replicates, leading to over-prediction of differentially abundant transcripts and high false-positive rates.”
Meanwhile, methods that try to “control for variability in gene expression across replicates” have focused “mainly on controlling for variability in the raw read data, but they miss key aspects of accurately transforming reads into gene expression values.”
In addition, alternative splicing and repetitive regions “introduce uncertainty into gene expression measurements” and if that isn’t controlled for, “this uncertainty can introduce errors during dif¬ferential analysis.”
Cuffdiff 2 improves upon the first version of Cuffdiff, which, although it could perform differential analysis at the isoform level, “[did not] take into account the variability you could assess from doing multiple replicates of each condition” — information that is “really crucial,” Pachter said
In the Nature Biotech paper, the researchers describe how they used Cuffdiff 2 to assess genetic response to the knockdown of HOXA1 — a transcription factor that is associated with cell development — in adult lung fibroblasts. They chose this particular gene because it has expression patterns in adult cells whose function is not yet known.
The found that knocking down the gene “perturbs the expression of thousands of genes, alters the isoform selection of key cell cycle regulators and causes disruption of the cell cycle leading to cell death,” the authors note, adding that "further experiments will be required to determine the nature and mechanism of the disruption and to identify the direct targets of HOXA1."