As the study of biology becomes more quantitative, statistics become increasingly important. Measuring confidence of our quantitative results can often be done using statistical methods, whether we are focused on our favorite pathway or performing genome-scale experiments. For biologists who are (or once were) not so statistics-savvy, it's easy to think that statistics will provide us with answers to questions that are much too messy otherwise. We've invariably been encouraged to include p-values in presentations and publications. Through classes, colleagues, or simply by reading and exploring, we can start to get a handle on the selection of statistical tests to address different questions, and the software needed to run them.
Statistics software is becoming easier to use, so we don't need a statistician to run many of our analyses. Then all we need to do is interpret the results, right? How hard can that be? Whereas all steps of a quantitative analysis are more complicated than they initially appear, from our experience it is this final interpretation step that is most difficult for many researchers. This is especially pronounced for genome-scale experiments where we have so much data that we cannot possibly look at all of it in a meaningful way. We discuss some of our experiences of applying statistics to parts of a wide range of biology experiments and the limits of what we can learn from them.
Most experiments start with a plan — a design describing how we are going to do every task. This is a key step, since the design will determine what questions we can ask and how rigorously we can answer them. Besides having a design that minimizes any biases or confounding variables, we will want to be thinking about sample size — how many biological replicates we need to address our questions. This also has a large impact on how strong a conclusion we can make about our experiment.
No matter the type of experiment, it's worth looking at sample size calculations, which are available in all statistics programs and even as Web tools. If nothing else, these formulas can give us a good idea of the relationship between alpha and beta (p-value threshold and the power to find a true difference), the smallest difference we want to identify, the variability of whatever we're measuring, and the minimum sample size. On the other hand, the formulas can be abused and shown to support almost any sample size. We've been told, however, that some modern technologies like high-throughput sequencing are so accurate that they require no replicates — it's like having a magical perfect ruler. We could use this ruler to unequivocally show that Katie is taller than Tom, but it'd be hard to cruise through our discussion and generalize that women are taller than men. Unless we use biological replication (not just measuring the same item multiple times), our conclusions may be limited to the specific samples we've studied, rather than a broader, more interesting phenomenon. We've also been told that some modern technologies have no batch effects, allowing us to bypass conventional experimental design recommendations. Sometime later we shouldn't be surprised, however, to find a study detailing some of these supposedly non-existent shortcomings.
[ pagebreak ]
After collecting our data, we are ready for analysis — but how, exactly, do we extract biological discoveries from a bunch of numbers? Are all of the data interesting, or are the outliers more revealing? When in doubt, we draw figures that show all the data. Of course, we need to use the best possible controls and statistical tests — especially if we're mining others' data to address questions that weren't the original aim of their experiments. However, we may have to do with what we have, acknowledging that our conclusions could have alternative interpretations. The last thing we want to do is perform 20 variations of an analysis, find one method that provides a p-value less than 0.05 and then conveniently ignore that we had looked at 19 other comparisons. For hypothesis-driven experiments, we're encouraged to select thresholds before performing the experiment, but for exploratory projects, choosing thresholds in advance is hard to do.
Why are we using statistics in the first place? We may have quite different answers to this question, but many of them may indicate our need to help translate numbers into conclusions of biological significance. For sure, this is what we'd like statistics to do, but often it's just wishful thinking. Our group recently tried, for example, to find the best statistical methods to identify differentially expressed genes from RNA-seq experiments, using data generated by the Microarray Quality Control project. We soon came up with a problem: What does "differentially expressed" really mean?
This brings up perhaps the biggest limitation of statistics: Statistical significance and biological relevance can be quite different. Even if we choose a valid statistical test and correct for multiple hypothesis testing, we need to choose a reasonable p-value threshold and perhaps an effect size. Should we, for example, filter with our p-value threshold and then sort by fold change, or use these metrics in some other way? How can we choose the most biologically relevant set of differentially expressed genes? Statistics can help, but cannot do it alone.
True or false?
For many studies, trying to identify optimal statistical thresholds ends up being an important and nontrivial step. One problem is that we would like our questions to have clear true-or-false answers, whereas much of what we analyze is in reality a continuum. Even if we collect extremely accurate measurements, with minimal biological variation, separating outcomes would be a difficult task, one prone to a high prevalence of false positives and negatives. For one thing, we probably want to carefully weigh the relative penalty of false-positives versus -negatives, which is largely a function of why we are doing the experiment in the first place. On top of that, we often have to consider the level of biological variation and sample sizes that are too small to capture the accurate extent of this variation. In contrast, the outcome of count-based statistics (such as Fisher's exact test), when used on high-throughput sequencing, can appear to have amazing confidence, with p-values less than 10-100 even for experiments performed just in duplicate.
Something seems strange here. "But the statistics say so" may not be a very convincing reason. In our experience, even though a p-value has a very specific meaning, we may have trouble taking it literally. Blindly selecting a p-value threshold of 0.05, even after correction for multiple tests, can lead us to conclusions that contradict our biological understanding. Perhaps our biological understanding is wrong, or perhaps our statistics are not accurately representing the system. Looking at confidence intervals may help, since a metric that consistently exhibits a very small change may be statistically significant, but not so biologically interesting.
As computational biologists, we are ready to preach about the importance of statistics as a tool to help make the most of a biomedical experiment. Choosing an optimal experimental design and statistical approach are important steps that allow us to best address our biological questions. We may not need a statistician to analyze our data, but many of us can use their input to help us clarify our design, methods, and what the statistics are really telling us. During experimental analysis and design, and also during our statistical interpretations, we need to be critical of the details and results, continuing to ask ourselves if they both make sense in the context of our biological system.
Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a senior bioinformatics scientist in Fran's group.