Tom Downey is president of Partek, a provider of statistical analysis and interactive visualization software. In addition to his research, he consults for biotech and pharma customers on design and analysis of genomic and proteomic experiments.
I still read publications resulting from microarray studies with a great deal of skepticism. This is partly due to the tricky nature of the data itself, and partly due to the widespread lack of proper statistical methodology used to design experiments, quantify spots, and analyze and interpret the resulting data.
Because of the large number of genes in the human genome and the relatively small number of samples in a typical microarray experiment, it is too easy to find seemingly significant patterns that aren’t really there. It’s like staring at stars in the night sky: With enough desire and imagination, one can find just about any pattern one wants to see. This makes things really tricky — it means that once we find a pattern (e.g. a cluster, a differentially expressed gene, or a set of genes that can classify tissue types) we must then go another step to prove that the pattern we’ve found is not due to chance alone.
The majority of researchers in the field are only partially aware of this “false discovery” problem. For example, most researchers now know to perform a multiple test correction to p-values resulting from statistical tests performed on thousands of genes. This adjustment effectively “raises the bar” on the quality and quantity of evidence required to conclude that the discovered pattern is not due to chance. But statistical tests are not the only way to uncover false discoveries in microarray data analysis. They can also occur when we think we’ve found a cluster or when we think we’ve discovered a set of genes that can form the basis of a diagnostic or prognostic model.
One common mistake is the incorrect application of gene selection followed by estimates of prediction accuracy using a statistical modeling technique called cross-validation. Many researchers don’t realize that they have used their testing data during gene selection, thus invalidating the results of later cross-validation. This results in dangerously over-optimistic claims about the ability to provide automated diagnosis or predict prognosis based on gene expression. A similar example of flawed analysis goes something like this: “We removed all genes that were not significantly differentially regulated and then clustered the remaining genes. The result is that the clusters are separated by tissue type.”
The problem with methods such as these is that they work very well on random data. That’s right, many of the methods used to make discoveries in today’s microarray data will also lead to discoveries in random data as well. I’ve provided a variety of random data sets that mimic the distributions of oligo and two-color cDNA arrays at the following URL: http://www.partek.com/public_data/random-genes. If you are able to find differentially expressed genes, interesting clusters, or are able to classify the groups in these data sets, there is something wrong with your approach.
The problem of false discoveries is only one of many problems with current practices that hurt the credibility of our industry. Other commonly accepted practices need to be questioned as well, including failure to protect against confounding during experimental design, analyzing two-color arrays as ratios, declaring genes “undetected” if their signals are too weak, and setting negative values to zero, to name just a few.
Microarray technology holds great promise. But there are too many false discoveries and exaggerated claims arising from microarray experiments today. Unfortunately, ignoring the problem does not make it go away. My advice: seek the help of a statistician when designing and analyzing microarray experiments, and question everything you read — including this.
Opposite Strand is a forum for readers to express opinions and ideas. Submissions may be e-mailed to [email protected]