AT A GLANCE
Principal Investigator at the Institute for Genomic Research
National Cancer Institute Director’s Challenge — analysis of gene expression in colon tumor metastasis
National Heart, Lung, and Blood Institute, microarray analysis of rodent models of human disease
National Science Foundation, Arabidopsis Chromosome II microarray analysis project
Arabidopsis thaliana functional genomics.
At last week’s MGED IV, the annual meeting of the grassroots Microarray Gene Expression Data Analysis working group in Boston, John Quackenbush of the Institute for Genomic Research spoke about pitfalls and strategies for data normalization. Quackenbush, along with Cathy Ball and Gavin Sherlock of Stanford, heads up an MGED working group on this issue.
This is an abridged transcript of Quackenbush’s talk during a tutorial the working group presented on microarray data normalization last Wednesday.
The starting point for [normalization] is a DNA microarray. Our arrays [each] have 32,448 elements. The array [shown on a screen] is part of a study looking at different tissues in mouse; in this case it’s a comparison of kidney vs. heart. It is part of a project funded by the NHLBI. We’re generating a large body of data on mouse models of human disease. This is part of a tissue encyclopedia we’re trying to build, looking at expression. Looking at it, it’s a pretty good assay. It’s one we are proud of showing.
One of the things you should try and strive for is having every assay be one you are proud of showing. If you ask me, ‘what’s the most important thing for being able to normalize data well?’, I would say [it’s] having good data. If you get really [terrible] data you are not going to be able to fix it. You can’t normalize away garbage. You can only correct the data a little bit. My approach is always that normalization should be like camping. You want to disturb the natural environment as little as possible.
When we think about array experiments, the other thing we often forget is they’re not done in a vacuum. You have to consider tissue samples and RNA samples. You have to know how they were prepared; you need to know what the laboratory conditions are for doing the labeling and the hybridization. You really need to know every step of the process. We don’t do Affymetrix experiments, but I think it’s going to be true of any platform, that all the things that happen in getting this process up and running are things you have to [watch]. You have to ask yourself, ‘what data should I collect?’ The group that organized this meeting, MGED, in fact worked towards putting out a proposition of what that is. And last year, at MGED 3 at Stanford, what we called the Minimum Information about a Microarray Experiment (MIAME) was in fact named by people there the maximum information about a microarray experiment. Because if you look at what you have to collect, it’s basically everything. You want to know your body temperature and your shoe size, and the last time you got a haircut. All of this affects the experiment. There’s a guy at NIH, Lance Dickson, who discovered on hot dry summer days, when he was wearing short sleeved T-shirts, there was a section of arrays on his printing path that was always bad: When he wore a short sleeved shirt and reached in to change printing plates on his arrayer, his deodorant particles fell down on the slide and caused a problem. Sorting these things out is a nightmare. But you really have to be aware of them.
Lots of people have been building MIAME-compliant databases. We have also been building tools to make sure people can enter the data into the database. We are working now on making a MAGE-ML [Microarray Gene Expression Markup Language] version of the database available so we can export the data. We have been working to put up SOPs about everything we do. SOPS are standard operating procedures, basically well-defined protocols so that each step in the lab is done the same way as much as possible. And instituting simple things like quality control can really go a long way towards making sure the data is appropriately generated and you can minimize these problems as much as possible. We are also working on a data QC SOP.
Before talking about normalization, I am going to talk about experimental design. This is work we have been doing in collaboration with Gary Churchill at the Jackson Lab, who is a collaborator on the NHLBI project. You really have to think about the experimental design ahead of time because the experimental design is going to drive some of the analysis you want to do. Every normalization you use relies on some assumptions, so you have to know what went in to the experimental design in order to be able to do the process appropriately. You have to know the relationship between samples. The design also facilitates comparisons, so if you do enough comparisons, you can gain some statistical power for identifying differentially expressed genes and assigning confidence measures.
The design also has to integrate experimental reality. The typical design people use is to compare biological samples to a common reference, and then do a dye swap. But [in an experiment] where you compare a reference to five experimental samples, the reference sample gets sampled 10 times, while each experimental design gets sampled twice. Is that a good or bad thing? You’re not getting as much analysis as you can. In [an interconnected loop experiment], for the same 10 hybridizations, you sample each one four times, and then measure the expression relative to the average expression across each array. But if one of the genes drops out, the experiment becomes difficult to do.
The assumption [with this loop experiment] is that the cost is in the array. If I want more consistency, I can do more hybridizations with a loop experiment. But in fact the real cost in experiments is not in the array — although they’re expensive — it is in the sample. In a cancer study, cancer patients aren’t usually willing to grow another tumor so you can get more RNA.
[In a loop design] you need to prepare more hybridizations, so you need more RNA. The good news is that if you use the loop designs and compare them to the reference designs, you can get essentially the same answers out. The loop design gives you more data on each sample so the statistics are better, but they require more RNA and you have to worry about what you are going to do with a bad sample. But with a reference design, the design is easily expandable. If we’re doing a study in mouse, you know how much RNA you have. But if you are dealing with human patients, a lot of times you have to deal with whatever samples come in. And [with a reference design], if you get more samples, all you have to do is compare them to a reference, and you’ve got a bigger data set you can analyze. The interpretation of the results is easy. In a loop design you have to do an ANOVA calculation to get the result, measuring the ratio of each gene relative to the average which you infer. [In the reference design], everything is measured relative to the same thing. Hopefully in the spring we’ll have some ANOVA tools available. Gary [Churchill] has them on his web site. We are taking them and importing them into Java to make them easier to work with.
From doing these comparisons, basic design principles emerge. Biological replicas are more informative than correlated replicas, so you would like to have independent RNA on independent slides. Even with the same RNA samples, running replicas on slides is good, but having independent replicas across independent slides gives you a slightly better estimate of what the variance is and gives you better control over the final ratios you measure. More replicas are better. For loops, each sample should have as many Cy3 as Cy5 labels. Self vs. self hybridizations are actually good because they add data on reproducibility. But at the end of the day we try to do with every experiment at least one replica with the dyes flipped.
To analyze our data, we built a tool called MEV. One of the things I started realizing was, we were building a lot of normalization tools into MEV, and we were doing large experiments which had 100 hybridizations, and each hybridization had 32,000 spots. When you loaded all that into a single tool and tried to do anything, [you’d get] hung up pretty quickly. So we created something called MIDAS, the Microarray Data Analysis System. MIDAS runs a series of different steps to filter the data before you do any kind of higher-order analysis. This is something that I would really recommend people think about doing if they are starting to do large-scale experiments. We have been building MIDAS for a couple of months and have a good working prototype. That prototype is designed in the future to be customizable. You define a method for how you analyze your data and then run it on a large number of samples. Right now we have [several] steps we use to process data, one is a signal-to-background filter, the other is a Lowess correction, and you can do it globally or grid-by-grid. We also trim the data using the flip dye ratios. That actually does quite a bit of good. And finally we have a replicate filter that is part of that flip dye trim. That turns out to be one of the most useful things we can do for filtering out bad data.