Robert Nadon says gene expression analysis leaves too much to chance
Robert Nadon is director of informatics at Imaging Research in Ontario. He heads the team of scientists and programmers who developed ArrayStat, a quality control and statistical inference software package for gene expression arrays. [email protected]
Gene expression research is changing from an observational to a statistically oriented science. This shift is made necessary by the massive amounts of data generated by array technologies. Corresponding requirements imposed by journal editors and granting agencies can be daunting, however, especially for scientists who may have had only minimal training in statistics. To make matters worse, limited budgets conflict with statisticians’ calls for large numbers of replicated measurements.
The shift toward statistics stems from discomfort with current heuristics for making inferences about gene expression. These heuristics can be straightforward (twofold changes in expression of cDNAs are trustworthy) or more complex (a sufficient number of “perfect match” minus “mismatch” oligonucleotides obtained from the same gene exceeds a pre-defined difference threshold). In either case, lack of a formal statistical framework threatens the validity of both differential expression judgments and downstream datamining, ultimately obstructing efforts to transform array data into biological knowledge.
Research practice is changing as a consequence in at least two ways. (1) Exploratory datamining output is no longer accepted at face value. Instead, cross-validation procedures are increasingly being used to confirm initial results. (2) Bread-and-butter quality control issues are being addressed. These include formal estimation of random and systematic measurement errors, data reproducibility, and statistical tests of differential expression. Statistically based inferences are replacing the old rules of thumb.
Some think these developments divide scientists. Because they are informed by biological knowledge, those who set their sights on biologically significant findings are said to “know” when results are meaningful. By contrast, those who are concerned with measurement precision and accuracy are said to rely on mere statistical tests. This is surely a false dichotomy. Although statistical significance does not necessarily imply substantive (clinical, biological, etc.) significance, it is normally a minimal requirement for an inference of scientific meaningfulness. If “chance” cannot be ruled out as an explanation for a putative effect, then biological meaningfulness is moot.
Contrasting analytical strategies (qualitative vs. quantitative; exploratory datamining vs. inferential hypothesis testing) reflect historical debates within science as a whole. Which approach is most appropriate depends on the data, the questions, and crucially on the relative importance of false-positive and false-negative errors. The issues are similar to those faced by medical diagnosticians. Initial screening tests for presence of a life-threatening disease minimize false negatives at all costs. Minimizing the risk of false positives looms larger in subsequent tests as invasive interventions are contemplated.
Similarly, how array data should be analyzed depends on context. As with medical diagnosis, however, decisions are most informed when based on probability models rather than on intuition.
Conclusions based on gene expression array data are now appropriately facing the same scrutiny as conclusions in other sciences. Moreover, lessons learned in array genomics will benefit other fields with similar data structures (large numbers of experiments with few replicates) such as high-throughput screening and proteomics. This is all to the good and will accelerate the pace of discovery in these promising but nascent fields of inquiry.