When most biologists talk about statistics, we think about the analysis of quantitative data after finishing an experiment. Similarly, when we present an experiment in a lab meeting, it's most likely when the experiment is in progress or is already complete. At that time we usually get good input from other lab members, but it may be too late: if they find fault in our experimental design or data collection, we may have to start all over again to get it right. This is consistent with the unfortunate observation that much more of the statistical literature concerns experimental analysis rather than design. If we have a less than optimal experimental design, however, we may not be able to answer all the questions we'd like or — in the worst case — have to throw away a lot of hard work and money. What can we do so this doesn't happen to us?
In order to best help others with their statistical analyses, we ask lots of questions about how an experiment was performed, and it quickly becomes clear where a design has shortcomings. To prevent this from happening, we encourage our scientists to talk to a friendly neighborhood statistician as early as possible, when an experiment is still in the planning stages. Simply making this detailed experimental design a priority can help resolve a lot of issues, as others — whether fellow biologists or statisticians — can often identify weaknesses before any bench work is started. Once upon a time, one of our lab members was involved in a complex look into rat muscle, in which a series of rats were injected with a drug in one leg, while the other leg was injected with saline, as a matching control. To reduce any chance of mix-ups, the drug was always injected into the right leg. When it was time to analyze the data, a statistician immediately identified a problem: what if most rats are left- or right-legged? We'd have no way of separating the effect of the drug from natural differences between legs. Any living "happily ever after" was tainted by this possibility.
Common causes of weaknesses in experimental design are confounding variables, including batch effects. We've known since high school science that we're supposed to control for everything that we aren't varying on purpose and, similarly, we're supposed to balance things that we can't control. In the rat leg example, we don't want to have to do another experiment just to show that right and left legs are similar, but that wouldn't be necessary if our treatments had alternated between sides. If we're using two-color microarrays, for example, we'd want to do dye swaps in case some DNA behaves differently whether tagged with red or green dye. Similarly, if we're using arrays that allow us to do multiple hybridizations on the same slide, it may be easier to do some slides of all treatment A and others of all treatment B. But if the slides behave differently, we won't know if it's due to interesting treatment effects or boring slide effects. Often subtleties of experimental design aren't even reported in the published methods, so possible confounding variables might not be apparent to people who read the results.
The type of sample replication, unless used correctly, can also limit the generality of one's conclusions. Most statistical tests require each measurement to be independent — a word with a whole range of meanings. In most biology experiments, we're most interested in measuring biological variability between individuals, which is generally greater than technical variability due to processing and measurement. As a result, we want to start our experiment with separate samples that are then processed independently through the rest of the project. If we're studying animals, however, do independent animals need to be in separate tanks or cages? If we're studying cell lines, can they all come from the same stock or should we get them from different suppliers? However we address these details, it should be reported, in case they end up influencing the outcome. If we're using microarrays with replicate spots, how do we process these during our analysis? If the spotted oligonucleotides are identical, we can summarize them before doing any inferential statistics (as with technical replicates in general).
If we have only technical replicates, we can still do statistics, but the p-values will only reflect technical differences. In some cases, like early development, we cannot get enough material from a single individual, so samples have to be pooled. Measurements of pooled embryos, for example, can be analyzed in much the same way as those from individual embryos, except any effects we observe need to be interpreted in terms of these pools — and we'll never know how much the measurement varies from embryo to embryo.
Once we've figured out exactly how to process our biological replicates, an important design consideration is how much replication to use. More is better, of course, but we don't want to waste any resources that'd be better used elsewhere. If we have some information about variability and know how small a difference we want to identify, we can use a formula to do a power calculation and estimate the sample size we need. For large-scale experiments, we'll probably need to do a multiple-hypothesis correction (like false discovery rate), so we'll need a larger sample size to get any adjusted p-values below our threshold. In vivo experiments on human or animal subjects usually require larger sample sizes than similar in vitro studies of cell populations, which as a group are more homogeneous. On the other hand, experiments with cell-based assays, such as assaying FACS-sorted cells, can involve huge numbers of sample replicates and produce extremely small p-values. For example, what if knocking out our gene leads to smaller cells (p=1e-50), but they're only 1 percent smaller? We'd like statistical significance to lead to biological significance, but does it in this case? If our other experiment is all about the comparison of mutant to control mice, is it best to raise more mutant than control mice? Despite what our intuition might say, statisticians tell us the answer is no: we should try to have the same number of individuals in each group. Studying more mutants would help us to better characterize mutant mice, but that wouldn't be of much help if we didn't know normal mice equally well.
Finally, let's say we're interested in studying a biological process such as the differentiation of stem cells on a genome scale. We collect cells over several points and gather information about RNA abundance, transcription factor binding, or DNA methylation for lots of genes. How are we going to analyze all of this data? It seems very useful to have this detailed information, but a complex design generally requires a correspondingly complex analysis. With so many subsets of data to compare, the analysis can quickly become so complicated that it's hard to turn into a biological story. If we start with a vague goal like "identify the genes that do something interesting," we'd better determine an operational definition of "interesting," and do it sooner rather than later. The last thing we want to do is have to ignore a large piece of our hard-earned data just because we have too much to make sense of. Living things are complex, but the simplest design that can answer our questions may be the best.
As we've seen, a lot of experimental design issues can fit into one of several categories: confounding variables, batch effects, types of replication, sample size, and complexity of design. Designing an experiment that answers our question may be easy, but what if our goal is to design the best possible experiment? Forming a good understanding of biological variability, while reducing technical noise and confounding variables, is often not so easy to do.
Talking about the details with a friendly statistician, together with our labmates, may help us create a much better, more informative experiment, rather than running to our friends for advice about trying to make some sense of a big mess. And we can hopefully optimize the details of our experiment so when we get to the analysis, we'll learn as much as we can from all our work.
Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a senior bioinformatics scientist in Fran's group.