AT A GLANCE
Assistant professor of pediatrics, Harvard Medical School.
Researcher, Pediatric oncology department at Dana-Farber Cancer Institute and Whitehead Institute/MIT Center for Biomedical Research.
Received his MD in 1989 from the University of Chicago, where he was given the McLean Award for most meritorious research. Completed clinical and subspecialty training at Harvard Medical School.
A PLETHORA of papers has been published lately supporting the use of microarrays to identify unique gene expression profiles for specific cancers — leading to speculation that these profiles can eventually be used as powerful diagnostic tools. Todd Golub, a researcher at the Whitehead Institute/MIT Center for Biomedical Research and the Dana-Farber Cancer Institute’s department of pediatric oncology, has spearheaded this trend, co-authoring 12 scientific papers on the development of gene expression profiles in cancers since 1999, and receiving the Discover Magazine Award for Technological Innovation for his use of microarrays as a cancer diagnostic tool.
Most recently, Golub served as a principal author on the January 24, 2002 Nature paper, “Prediction of central nervous system embryonal tumor outcome based on gene expression.” In this paper, Golub and his co-authors found a way to distinguish gene expression profiles for medulloblastomas, the most common type of malignant childhood brain tumor, from other brain tumors such as primitive neuroectodermal tumors. Golub spoke to BioArray News recently about this paper and his research:
With the Nature paper, I am particularly interested in how you came to apply microarrays to the experimental questions involving medulloblastoma molecular profiling. How did you see your work adding to the research in this area?
The brain tumor study grew out of a general interest in trying to take a birds-eye view of human cancers using genome-wide expression profiling. Because I am a pediatric oncologist, this was of particular interest because it is an example of the clinical heterogeneity of tumors that are morphologically the same. It was emblematic of the general problem of wanting to understand who is going to respond to therapy and who isn’t, and emblematic of the general problem that we have very crude and unsophisticated approaches to treatment for most of these cancers.
In general, the studies are not going to immediately impact patient treatment: The studies reported to date should be considered preliminary. But given the encouraging results in a number of areas, the studies suggest that it’s worth everyone’s time and effort to do the clinical studies to validate these results at clinical centers on multiple patient populations.
Of course, for particularly relatively uncommon childhood cancers, doing large-scale follow-up studies is not trivial. For medulloblastomas, the number of patients is, thankfully, relatively small.
You chose to use 6,800-gene Affymetrix arrays for your research in this paper. Why did you choose these arrays and were there any particular challenges? Are you using the Affymetrix U133 set now?
We used the 6,800-gene microarrays because they were the highest density arrays available at the time. We are beginning to use the newer U133 arrays, but the arrays themselves are not the issue in this type of experiment. The challenge is the complexity of the question, and in particular the high dimensionality of the data and the relatively low number of samples being analyzed.
In the study, you used these arrays on 99 patient samples. This must mean you used at least 99 arrays. Given the cost of Affymetrix chips and other similar prefabricated chips, even with academic discounts, this would seem to mean the cost of the arrays alone for this study would be at least $35,000 if not more. Is cost a limiting factor in these experiments? Are higher-throughput arrays that can handle more than one sample at a time needed?
Cost is one of the factors in these studies, but the most limiting factor is, without question, the availability of sufficient numbers of high quality, clinically-annotated samples.
I have heard from numerous researchers that oligonucleotide arrays are generally preferable to cDNA due to the fact that the clone validation, and other sample prep stages are much simpler. What is your view?
My personal feeling is that the there may be a role for cDNA-based arrays in customized approaches, particularly in unique organisms that don’t have commercial products available, where you can do subtraction screenings then spot subtraction clones on an array, confirm them and subsequently sequence them. I don’t see a great future for large clone sets, though.
In analyzing your data from the embryonal tumor experiments, your group chose to use both unsupervised analysis tools such as principal component analysis, hierarchical clustering, and self-organizing maps; as well as supervised learning methods. Supervised learning algorithms seem to present a clear advantage — being able to clearly separate the tumors into two subtypes by gene expression profile. Yet this bifurcation of tumors into survival vs. no survival seems to be also limited in that it could miss more cogent distinctions that might separate the tumors into three or more groups. What do you think of this criticism, and is there a way to evolve the supervised learning method to account for these multiple possibilities?
The supervised learning algorithm does force the question into a somewhat oversimplified situation of good vs. bad, and there are some alternative statistical approaches that we’re exploring to do that. We consider this kind of a first general look at this problem. Ultimately, it isn’t clear what the best way is to look at this question analytically. Again it depends on what you are trying to predict: the prediction of who will relapse and who won’t is a different question than who will survive and who won’t, and ‘what is the period of survival?’ We might approach these difficult questions slightly differently.
Data analysis has been an area of much discussion lately. What is the biggest challenge with analysis of microarray data?
I don’t think there is a data analysis issue. Part of what has been the problem is people seeking the single analytical answer to all queries related to microarrays and not giving enough thought to the nature of the question. Different questions might require completely different analytical methods. It’s important to choose the right analytical approach for the right applications. This is not a matter of a contest between supervised and unsupervised learning methods. They have very different roles.
How long do you think it will be before microarrays are used as a prognostic and diagnostic tool? What do you think such an array might look like?
I don’t know how long it is going to take, but I think it’s likely to be at least several years before [gene expression profiling] makes it into the clinic for routine use. I think there are a number of things that need to happen. For one, the results of our and other initial studies need to be confirmed in a research setting. Number two, the pace will differ for the different indications. The way in which these diagnostics will reach clinical implementation hasn’t been worked out. Whether it’s going to be an abstraction of these profiles or whether they will remain in a microarray-type format is just uncharted territory. There is likely to be an initial set of studies that set out to validate the results of the initial microarray observations. A second round of validation studies may then be required if the microrray results are translated into some other platform such as a PCR assay or customized DNA microarray.
On the question of “do you use microarrays simply for discovery and then abandon them in favor of some other detection method, or do you stay with arrays?” my personal opinion is that our default position should be to stay with arrays, and move into the clinical settings, unless it is proven impractical to do so.