AT A GLANCE
Professor of statistics in the department of mathematics at the University of Queensland, Brisbane, Australia
PhD, Mathematics, University of Queensland
DSc., Mathematics, University of Queensland
Recently published an article, “Selection bias in gene extraction on the basis of microarray gene-expression data, “in Proceedings of the National Academy of Sciences, co-authored by Christophe Ambroise of the Unite Mixte de Recherche/Centre National de la Recherche Scientifique in France.
As a mathematician, how did you get into microarray analysis work?
I am in the statistics section in the maths department at the University of Queensland, and I attended the International Biometrics Conference in Berkeley, [Calif.], about two years ago, where people from Stanford and Berkeley presented papers using classification techniques, actually cluster analysis and discriminant analysis. I thought that microarray [analysis] is a very interesting, although not a straightforward, application of these techniques.
So have you found microarray analysis to be needed back home at your university?
My university, University of Queensland, is supposed to become the Australian focus of molecular biology. We’re getting an eight-story building which will house people from this university and also from CSIRO, the Commonwealth Scientific and Industrial Research Organization. It is being funded by the state government, from the Australian government, and from other sources. Next year we do have the ISMB (International Society for Molecular Biology) meeting here.
So with the PNAS paper on selection bias, how did you and Dr. Ambroise come to attack this problem, and attack it together?
Christophe has been interested in problems in my area, mixture models. Finite mixture models, in particular, are a very flexible method of modeling for statistical inference in general, but in particular for cluster analysis. They provide a model-based approach to cluster analysis and provide a mathematical framework for cluster analysis. In 1998, I wrote a book jointly with Kay Basford, of [the University of Queensland], on mixture models. Christophe was interested in mixture models. He said he would like to spend six months working at this university in the summer period. When he came here, I got him interested in microarrays.
In your paper, you talk about the problem with leave-one-out error models. It seems that you are saying that leaving out a particular sample or data point, and then seeing if you get a comparable result, is circular because the rule is tested on the very genes used to make the rule. Can you explain this further?
There’s nothing wrong with the leave-one-out method. The only thing is, in this particular case it is highly variable. Instead of leaving one out, you should do a 10-fold cross-validation where you divide the sample into 10 subsets of roughly equal size. Just leaving one out was highly variable. People have not done an external cross-validation. They have done an initial gene selection, kept those genes, and performed cross-validation on them. Every time they perform cross-validation, and every time they leave one out, the tissue or subset of tissues should come up with a new set of genes. Otherwise, you are testing the rule on tissues used to select the genes in the first instance — you have this sort of incestuous situation. You expect the rule to do well on the tissues used to select those genes, but when you apply it to new tissues that come along it has a higher error rate. In the paper, if we did take into account the selection bias, the basic error was 20 percent when we used 16 genes.
Now you suggest an alternative, the 10-fold cross-validation. How does that work and what if you have few tissues?
Using 10-fold cross-validation in statistics is quite common. Some people do it to save computation. You divide the tissues into ten subsets and compare them to one another. If you had a smaller number of tissues it is true that you cannot do much in forming a prediction rule. But [for this situation] we describe the bootstrap procedure. It’s an alternative procedure where one tends to draw a sample with representatives from the original tissues and form a rule from it. Then you apply it to tissues not selected in the bootstrap sample. But using the bootstrap procedure is not always so effective because the rule is overfitted. When you’ve got an overfitted rule, professors Brad Efron and Rob Tibshirani from Stanford suggest a modification, the .632 plus estimate.
Other than the selection bias you describe in the leave-one-out model, what is biggest problem with microarray analysis?
The one major issue is actually getting clean data to work with in the first instance. There are a lot of people working on these problems and I think they basically should be resolved in the short-term. Assuming you’ve got data you’re happy with, I have also been looking for ways to cluster the tissues. As far as clustering of the tissues goes, the literature has focused mainly on the hierarchical or K-means methods, which don’t take into account that genes can have different [expression] variances in the different tissue classes. I’ve been interested in a model-based approach to clustering. I have tried to apply it to the clusters of the tissues, except that it’s a non-standard problem, as usually the number of objects that we are going to cluster are very large related to the number of features on each observation. I’ve been developing a product, EMMIX-GENE, which I published in the March issue of Bioinformatics. It is a new algorithm for clustering tissues. It takes into account that the genes have different variances. With K-means, you can only produce spherical clusters. This [algorithm] allows you to produce elliptical-shaped clusters of tissues. It has three stages: It tries to select genes that are useful in discriminating among the tissues, then tries to [divide] the tissues into groups that are highly similar (i.e., highly correlated), then tries to cluster the tissues using the sample mean of each group of genes. You may have to start with 7,000 genes, and may screen them and wind up with 2,000 genes. Those are clustered into 40 groups. Each group is represented by its mean. The point is to cluster the tissues on the basis of the 40 group means, whereas you were originally trying to cluster 120 tissues on the basis of 7,000 genes [in each.]
So it sounds a little like Principal Component Analysis, in how it winnows down the complex dataset to fewer components.
This method differs from PCA in the sense that PCA looks at the variation in the tissues across all classes. [This method is] doing it within each cluster of tissues. Whereas PCA is done across the whole sample, there may be internal variation within clusters that dominates the between-cluster variation, so PCA is not guaranteed to succeed. People have been getting useful results using PCA: I don’t want to knock PCA, but it does have its limitations when it comes to clustering. This is a more sophisticated approach.
So where are you taking this research?
Well, we are developing a Windows version of EMMIX-GENE with a view toward commercialization. Also, people are talking about having, instead of 100 tissues, up to 1,000 tissues. It will mean some of these methods will be able to be applied in a less approximate form because we will have more tissues available. It seems to me that everyone is getting funds to produce microarray data, and [as a result] I think we’re going to be inundated with microarray data.