At a Glance
Co-directs the multiplexed gene analysis (MGA) core at Washington University’s Siteman Cancer Center, a National Cancer Institute cancer center. Assistant professor of biostatistics in medicine, Washington University School of Medicine.
Received his PhD in biostatistics from University of Pittsburgh Graduate School of Public Health.
Currently running for president of the Classification Society of North America, a society for professionals developing and applying statistical tools. (www.pitt.edu/~csna).
Research interests include developing new statistical methods for clustering and classification analysis.
QHow does the Multiplexed Gene Analysis Core fit into the activities of the Siteman Cancer Center?
AThe MGA core is designed to provide cancer center members with access to emerging technologies that allow for high-throughput molecular and genetic analysis. Currently, the core supports Affymetrix GeneChips for gene expression profiling in human, mouse, rat, and yeast. A centralized bioinformatics resource that will allow investigators to collect, organize, and analyze large gene expression data sets is currently in development. In the future, other technologies for gene expression analysis (e.g., spotted microarrays), mutation scanning, and proteomics will be developed and supported.
Mark Watson, [the co-director of the core] runs the biology end of the core, handling samples, isolating RNA, hybridizing the chips, etc., and I run the informatics end of the core, managing the data, providing researchers with statistical advice, and running analyses. We talk on a daily basis, and our combined areas of expertise have formed what I think is a very successful and long-term collaboration. To date, the MGA core has run well over 1,500 chips for hundreds of different studies.
QWhy did the facility choose to go with the Affymetrix platform initially?
AMark established the MGA Core as an Affymetrix facility to allow for quick, plug-and-play, GeneChip experiments. Basically, we buy Affymetrix chips, which come in a box, and can run an experiment that afternoon. This has allowed cancer center investigators to quickly get involved with GeneChip studies with very little fuss.
At the same time Mark was setting up the Affymetrix core, a spotted microarray facility was being developed jointly in the departments of genetics and molecular microbiology. The spotted microarray is being used more for non-cancer basic science research.
QSo how will you combine the two platforms?
AOur plan is to use Affymetrix for first-round analyses. The genes identified by Affymetrix as differentially expressed across a few samples will then be spotted onto microarrays for larger-scale studies where perhaps hundreds of samples can be tested. Our two-system approach alleviates the problems of cost and sequence selection.
I collaborated with Anne Bowcock, who is co-director of the division of human genetics at Washington University, on a study comparing psoriatic with normal skin samples. In this study, our statistical analysis of 32 Affymetrix chips was used to rank the 12,000 gene sequences on the Affymetrix HU95A chips in order of their differences in expression across the two tissue types. Anne has used this ranking to select about 1,000 genes to spot onto microarrays. She plans on running hundreds of samples to learn more about the genetics of psoriasis. This study is an excellent example of how our two-system approach works.
QWhat statistical tools do you use, and why do you use them?
ASAS statistical software is the primary tool we use for managing and analyzing microarray data. As a professional statistician it is important that I be able to control the fine tuning of a statistical algorithm. SAS gives me that kind of control. First, it has full relational database capabilities and SQL commands. Second, it has a comprehensive suite of statistical tools such as cluster analysis, principal components analysis, graphics, neural networks, and self-organizing maps. Also, its programming language can be used for transforming, normalizing, and scaling data in any way. Third, it allows web access to the database.
However, SAS is very difficult and it is unreasonable to expect researchers to learn how to program to analyze their data. We have therefore also invested in Spotfire, and have an academic license that allows us to distribute Spotfire to any Washington University investigator. Spotfire provides many of the basic statistical methods needed like hierarchical clustering, k-means analysis, and graphics. In a weekly seminar series, sponsored by the two microarray facilities we often teach investigators how to use Spotfire, and what the basic clustering algorithms are doing.
We are currently implementing more advanced statistical methods, such as gene shaving and adjustment of P-values for multiple testing where the investigator can access a web page, upload their data to our Sun server, and perform one of these advanced analyses.
QAt the recent microarray analysis conference, people discussed the inapplicability of p-values to high-density microarrays, and proposed alternative approaches such as false discovery rate. Others have proposed using significance analysis of microarrays (SAM). What do you think of these approaches?
AThis is the multiple testing problem in statistics. Consider a GeneChip experiment where you have measured the expression level in one gene across many samples, with some samples being tumors, the other samples normal tissue. A simple analysis to decide if the expression levels differ across the two types of sample is the t-test. From mathematical statistics we know that the probability of saying the expression level does differ by chance is the P-value. This is basic statistics.
Multiple testing happens when you are doing the t-test for more than one gene. The P-value no longer means what it does for the one gene example above. Imagine we have done 12,000 t-tests on the genes on an Affymetrix chip. If a gene has a P-value of 0.01, this does not mean that the difference in the gene is occurring by chance with probability 0.01. The ëtrue’ P value for this gene will be much higher, say P = 0.40. (This can be shown mathematically, but for now please take my word.)
When we run 12,000 t-tests and decide every gene with a P-value less than 0.05 is important, we will be wrong in many casesñmost of those genes’ true P-values will be much higher than 0.05. The FDR, SAM, and multiple testing adjustments of P-value methods attempt to take this into account and provide a better metric for deciding which genes are important.
QSo which is the preferred method for winnowing out the important genes?
AI have a confession to make about statisticsñthere is often not one best method but many methods that do the same thing. My recommendation is to run all these methods and compare the results. Imagine a study where FDR, SAM, and multiple testing adjustment ranks the genes in the same order of importance. This is a good analysis that an investigator can trust.
I can’t emphasize enough that microarray data analysis is essentially a clustering problem. The basic question is which genes have similar expression level patterns. After this is answered, it is a biological problem to determine what these similarly expressed genes are doing.
John Hartigan at Yale is a leader in classification and clustering theory. He has argued that cluster analysis is a data reduction technique, and the concepts from classical statistics, like P-values, are inappropriate. I tend to agree with this view. P-values for microarray data provide a screening tool, but investigators need to avoid interpreting them the way they learned to interpret them in basic statistics class. I think the analysis of microarray data must be viewed as a data reduction problem based on some type of clustering analyses and visualization.
I believe the interest in microarrays will result in increased research opportunities for statistical methodologists to find ways of validating the results of cluster analysis, perhaps even with P-values and confidence intervals. I currently have two NIH grants under review to develop methods to do exactly this.
Core website: www.siteman.wustl.edu/physician/ research/shared_multiplexed.shtm. Shannon’a web page: http://ilya.wustl.edu/~shannon.