By Monica Heger
RNA-seq experiments can yield variable results even when the exact same protocol is followed for a replicate of the same sample, researchers from the University of Florida, the University of Southern California, and the Mayo Clinic have found.
Reporting this month in BMC Genomics, the team found that technical variability is "too high to ignore," and results in inconsistent detection of exons at coverage lower than five reads per nucleotide. Even when coverage is high, however, the estimate of the relative abundance of a particular transcript can "substantially disagree" between sequencing experiments, the researchers found.
"It's something you need to take into account when designing RNA-seq experiments," said Lauren McIntyre, lead author of the paper and a professor of molecular genetics and microbiology at the University of Florida.
McIntyre said she decided to test how much variability there is in RNA-seq experiments after RNA-seq data from her lab seemed to suggest that there might be more technical variability than previously thought.
She and her team ran three separate RNA-seq experiments on the Illumina Genome Analyzer. In the first experiment, the researchers evaluated three different samples from the heads of Drosophila melanogaster. For each sample, they ran the same sequencing library on two lanes of a flow cell. In the second experiment, the researchers followed the same protocol but used samples from the heads of male D. simulans. And in the third experiment, the researchers ran one sample of a D. melanogaster cell line on five lanes. For each experiment, the researchers used a paired-end sequencing approach with 36-base reads.
In general, the researchers found that the lower the coverage for a particular exon, the more variability there was between technical replicates. However, even with higher coverage, there was still variability. On average, for each pair of replicates, around 36,000 to 49,000 exons were shared. However, several thousand exons were found in one replicate but not the other. The comparison with the least discrepancy had around 3,600 exons missing in one of the two technical replicates, but many had more than 5,000 missing and several had nearly 7,000 exons missing.
Discrepancies in detection were largely due to differences in coverage, with exons detected fewer than five times being the most likely to show discrepancy between samples. However, because differences were detected even in well-covered exons, coverage was not the only reason for the variability.
Library prep is typically cited as a step that introduces error in sequencing experiments, but the researchers used the same sequencing library for the different samples, which ruled out that step as a cause for variability in the study.
Likewise, they eliminated normalization because they found that the results were "virtually identical" regardless of the normalization method they used. They also found the normalization constant to be "close to 1 among technical replicates, indicating that these results cannot be explained by improper normalization."
One factor that may be responsible for the discrepancy is the dilution step in the RNA-seq protocol. Uneven distribution of the molecules in the library at the dilution step could lead to differences in what molecules are loaded onto the sequencing lane, for instance. Additionally, the authors of the study note that RNA molecules form intra- and intermolecular interactions, which could cause clustering and lead to a non-random distribution of molecules onto the sequencing lane.
Sampling is another contributing factor, said McIntyre. One sequencing lane might generate around 30 million reads, which represents less than 1 percent of the starting library, she said. It is expected, due to random sampling, that there will be variability in the measurement of all the abundance of all the transcripts.
This type of variability is likely to remain, even as the technology improves.
"The issue is not a problem with the technology," said McIntyre. "We're taking a sample, and it's a small fraction of the total, so even as the technology improves, it will still be a small sample of the total," McIntyre said.
Additionally, it is possible that there are differences between the sequencing lanes themselves. One way to account for variation between sequencing lanes is to multiplex samples, said McIntyre. For example, if a researcher has eight samples, the samples can be multiplexed and run in multiple lanes, as opposed to running one sample in each lane, in order to account for variation among the lanes.
The authors conclude that technical variation "cannot be ignored and should be accounted for in the study design," though they note that "the optimal experimental design strategy will depend on the objectives of the study."
Although the authors ruled out normalization as a factor in the variability, Xiangqin Cui, a biostatistician at the University of Alabama, Birmingham, said that the study suggested that better normalization algorithms are needed. The fact that there is variation between technical replicates is not too surprising, she said, but it does offer evidence that could be used as a jumping-off point by which to develop better normalization methods.
For instance, there have been a number of studies on how to normalize data from microarrays, Cui said, and now there is data suggesting that researchers need similar normalization methods for RNA-seq studies.
The paper suggests that "technical replicates can be used as a criterion to develop better normalizations methods that are appropriate for RNA-seq," she said.
Have topics you'd like to see covered by In Sequence Contact the editor at mheger [at] genomeweb [.] com.