**AT A GLANCE**

Chief of Biometric Research, National Cancer Institute

Current research interests: Bayesian methods in clinical trial design and analysis, developing methods for analyzing genome sequence and expression data to identify cancer-related genes, define their functions, determine the steps of tumor development, identify molecular targets, and develop genome based approaches to the prevention, detection, diagnosis and treatment of cancer.”

*Myths about microarray data analysis abound in the research community, according to Richard Simon, Chief of the National Cancer Institute’s Biometric Research division.*

* At the “Macro Results through Microarrays 3” Conference in Boston last week, Simon attempted to disabuse the audience of these misconceptions, and suggested more effective ways to analyze data. Following is a summary of his talk.** *

**At least** in the field of cancer, there is a real gap between the state of practice and the state of the knowledge base. My greatest interest is to help people narrow that. And the first step in narrowing that gap is to unlearn what we have learned about microarray analysis, the microarray myths. One of these myths is that the only challenge is managing the mass of data. It certainly is a challenge for a lab or a larger organization, but it is not the only challenge. People also originally had the misconception that microarray data analysis was just collecting a bunch of expression profiles, putting them into a black box, and looking for interesting patterns. Although we are not using microarrays to study gene-specific mechanistic hypotheses, nevertheless most of the ways we are using microarrays do have a biological objective, and that objective really needs to drive the design and the analysis.

Given this biological objective, it is a myth that cluster analysis is the generally appropriate method of data analysis. On the other hand, people are enamored with complex classification algorithms, and an expectation that they would perform better than simple algorithms. But there is empirical data showing that complex methods do not work better than simple methods. That said, a simple solution in the form of a prepackaged analysis tool, is not a substitute for a collaboration with statistical scientists in complex problems. Many of us who are involved in large-scale experimentation have found that it is very valuable to have a statistician as part of the team.

Now some of this misinformation about how to analyze data is perpetrated by statisticians who do not understand the biology, and really answers the wrong questions.

Many data analysis methods, for example, address the statistical significance of a ratio of intensity on one array. Other methods address the significance of different intensities between two arrays — for example the Affymetrix software does this — and many methods address the significance of difference in expression levels between two RNA samples that may be hybridized to multiple arrays. But in general, these are not the biologically meaningful questions, and it is important to distinguish among the levels of replication. Dividing an RNA sample in multiple aliquots in multiple arrays is not the same thing as sampling cells of multiple individuals in the population and looking at their expression profiles.

In many situations, where there are many biological questions being asked, replication of samples should generally be at the highest level, the subject level. You want to make inferences about expression profiles for one type of tissue compared to expression profiles for another type of tissue, and how expression profiles of tissues in individuals with a disease differs from the same type of tissue of those without the disease — the kind of problems we call class comparison and class prediction. We want to make biological inferences about the disease, not about two specific RNA samples whose expression may be influenced by all kinds of specific effects.

With microarrays, class comparison and class prediction types of problems are very common, more common than what we call class discovery. These types of problems are not really clustering problems. Clustering uses all of the genes or all of the highly variable genes, and is not necessarily sensitive to the genes that are affected in distinguishing these classes. For class comparisons or class prediction problems, supervised methods of analysis are better than unsupervised methods such as clustering algorithms, because supervised methods utilize class identifiers as part of developing the predictor.

For class comparison, the state of the art today is comparing classes on a gene-by-gene basis using statistical tests and controlling at the margin. These comparisons involve many different kinds of tests. Permutation-based tests are better than t-tests, because they don’t depend on the assumption of Gaussian distribution. You are also not pooling different genes that have different within-class variances. Instead you are utilizing information from different genes and giving a standard error. In addition to these hierarchical models, there is also a body of analysis of variance methods using the log intensities.

Also, there are global tests, first saying ‘are these classes different with regard to expression profiles overall?’ then looking at what genes are responsible for the difference in expression. There are a number of ways of doing this type of comparison. In other kinds of experiments, where a relatively small number of comparisons are being made, the tradition is to control the comparison so you have a small probability of making any false claims as part of the experiment. With microarray data you generally don’t want to be that conservative. You do want to control that. You don’t want to have 50 false positives in that gene list of interesting differentially expressed genes. But if we have a gene list with five or ten false positives, that’s probably OK. We want to compare the number of false discoveries or the proportion of false positives.

Some relatively simple procedures have been developed for this. If you are comparing two classes with regard to ten thousand genes, one at a time, you want to have no more than ten false discoveries. If you do a t-test at less than the .001 level, which is ten divided by ten thousand, it gives you an estimated false discovery rate of ten. That’s true even though the genes may be very highly correlated. If you wanted a total estimated proportion of false discoveries to be less than some constant gamma, there’s a simple procedure to do that. The more conservative Bonferroni control, having less than a five percent chance of any false discoveries, is generally too conservative.

If you calculate a p-value for each gene — say n is 10,000 — from a two sample t-test or permutation test, and rank the genes from the smallest to next smallest p-values then find the largest index I for the which the false discovery is n(p)(I), then this ratio is the expected number of false discoveries. Divide this by the number of discoveries, and you get the false discovery rate.

There are other ways to do this. For one, we used a very simple class predictor, the compound covariate predictor, or CCP. This is a linear combination of the log ratios. We found that if you don’t do cross-validation, it is a class predictor in over 90 percent of the classes. Then there is diagonal linear discriminant analysis. In this method, you assume that two classes in a multivariate set fit the normal distribution. You assume the log ratios have multivariable Gaussian distributions and the two classes have different mean vectors. This is similar to neural network classification, but there are no hidden nodes and there is a linear transfer function at each node. Taking the log intensities of the genes, the two classes have two different vectors of genes. It uses the clinical data to estimate those gene predictions based on which of these multivariate distributions has the highest probability of the sample you are trying to predict for. Doing diagonal linear discriminant analysis is similar to Todd Golub’s weighted voting method. It is also very similar to the compound covariate predictor.

In summary, Terry Speed’s group at UC Berkeley compared different methods of microarray data analysis using certain large datasets, and their conclusion was that the simplest methods worked the best.

The other part of state-of-the-art microarray informatics is to use software that incorporates good statistical methods and good statistical design. We have a package, BRB tools, that we make available to researchers on our website, *http://linus.nic.nigh.gov/~brb.*