Public microarray data has been a gold mine for many computational biologists, providing a source for a host of novel discoveries. Many experimental biologists have also been able to uncover data that gives them ideas for further study or that adds support to a novel hypothesis.
We've found a lot of value in microarray repositories, but have also found that what we discover is usually more like ore than pure nuggets. If you're considering doing some chip hybridizations yourself, it's worth first taking a look at some others' data — even if their experimental design isn't ideal. We'll share some of our experiences, as well as the pros and cons of using others' data to help answer your questions. The state of public repositories is always getting better, so don't discount their use even if earlier experiences weren't so good. You don't need to be a computer or statistics whiz to do any of this, although it can be a help with larger datasets.
If you're investigating a very specific system, you've probably already read all the literature and know what large-scale experimental datasets are available, so you're off to a good start. If you're trying to investigate a more general phenomenon (like how some favorite genes are influenced by cancer), there's so much potential array data that it can be hard to know where to go or what to check out. Microarray repositories aren't as monolithic as sequence databases; there's no one place to go that has virtually everything. Instead, array data is scattered between the main repositories (NCBI GEO and EBI Array-Express), institutional repositories (like the Stan-ford MicroArray Database), journal and lab websites, and specialized databases (like Onco-mine, specializing in cancer). Each of these is different, and finding what you want — or even whether it exists — can take some time. The better interfaces permit searching or browsing by experiment or by gene. To limit our discussion, we'll assume you're searching for an experiment's worth of data at once, and that you can find what you're looking for: a comparison between your favorite cells in your favorite conditions. What follows are key steps to consider as you approach this work.
What sort of data do you need to download? Hopefully you can get a matrix (tab-delimited spreadsheet) of processed expression values or ratios. You may be able to do a lot with just figures (like heatmaps) generated on the fly, but we usually like to get our hands on the real data. The raw data may be even more helpful if, for example, you'd like to process multiple datasets in the same manner — but that'll require a lot more work, so a quick look at the processed data may tell you if it's worth it. What the numbers actually represent is not always obvious, as sometimes there is no mention that expression values have been log2-transformed during normalization (such as by the RMA and GCRMA algorithms). Besides the numbers, and just as important, are the detailed descriptions of the samples used for hybridizations and all the features on the arrays. This and many more experimental details comprise the current MIAME (Minimum Information About a Microarray Experiment) standards that are often required — but not always enforced — to submit at the time of publication. Some datasets, like the Expression Project for Oncology, contain a wealth of sample information that can really enhance your analysis.
Features on arrays in public databases are not always easy to interpret. The good news is that as our understanding of gene catalogs improves, array annotation does, too. Custom microarrays, especially older ones, can contain out-of-date annotations, and even current oligonucleotide arrays do not connect all sequences to known genes. Many array manufacturers try to include features representing all genes of a species and, as a result, they include many expressed sequence tags that can be very tricky to link to an established gene annotation. Additional array feature details are sometimes also available from the manufacturer, such as Affymetrix's annotation files, but even these can be open to debate. Fortunately, the actual array sequences are usually available, so researchers can re-annotate spots of interest. Time-consuming though that may be, confirming especially important probes in a genome browser can reduce some wrong turns.
Crunching the numbers
With the expression matrix and annotations, it's time to do your actual quantitative analysis. If you downloaded processed data, everything should be normalized and ready to go. A spreadsheet is fine for basic descriptive statistics, but if you want a more rigorous analysis, you can choose from a lot of different array analysis and statistics applications. In the interest of space, we're being very vague about this critical step of array mining, but any microarray analysis resources can provide more information. Public data can give you an idea of the biological variability in your system, a big help to figure out an appropriate sample size for designing your own experiment. A small experiment may show a hint of an interesting observation, in which case you may need to keep looking for a larger study with more statistical power.
Ideally, a single, well-controlled study will lead you to the data you need, but you may find that you would benefit from looking at multiple studies in some comprehensive manner. Combining data from different experiments may better answer your research questions, but it may just end up being a bunch of technical noise that hides any interesting stories. In any case, the selection of datasets and how to combine them is just as important as the design of any laboratory experiment. A lot of uninteresting variation can be introduced by different sample preparations, different sets of microarray probes, and different array normalization. We're often stuck with different RNA purification protocols, but we may be able to choose studies that used the same array design.
The MicroArray Quality Control project recently showed that differentially expressed genes assayed with various array platforms are highly correlated, but we still want to control as much as possible. Array normalization differences can be a big issue, but if it is, we should have access to the raw data that we can then process in a similar manner (thanks again, MIAME). To address this last issue and save others some time and effort, projects like Celsius at the University of California, Los Angeles, were created to collect Affymetrix data from many sources and normalize the raw CEL files using the same series of protocols.
If we do want to combine datasets, how do we do it? Fundamentally, we have to figure out which comparisons give us the most discriminatory power, despite the limitations of a meta-experiment, which is not all within our control. Another way of thinking is, "How can we maximize biological signals in a noisy set of data?" Several state-of-the-art methods have been published, but we'll assume for now that we don't have the expertise to take advantage of these. In an ideal case, perhaps we could simply combine expression values or ratios across all genes and samples. Another choice is to analyze differential expression in each experiment separately and then compare gene lists. A third choice could be to make separate gene lists, determine the biological themes, and then compare those themes. From the first to third choice, we're putting aside more detail in an attempt to make a more reasonable comparison.
The perceived current state of public microarray data is very much a function of one's research. Whereas investigators working on very specific problems may complain, "There's no good data out there," others may find quite the opposite: there's so much data it's hard to know where to start. If you can find a study that may address your questions, dig right in and start looking for those hidden gems that others may have overlooked.
Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a bioinformatics scientist in Fran's group.