You’ve just generated or downloaded a big matrix of microarray data. How are you going to begin to make sense of it? Whereas lots of statistical methods can help address this general question, at some point you might want to just take a look at the data. To get a broad picture of any potential patterns in the data, nothing beats a big heat map of the matrix, after clustering by genes and/or experiments. Fortunately this is quite easy to do with any of several free, open source microarray applications. We review a few of the best, along with some ways to make the most of them. We’ll use the example of expression data, but we’ve found that these tools are good at ordering and viewing lots of very different types of data. After exploratory clustering — together with the rationale for doing the experiment — we’ll hopefully have a better idea about directions that our subsequent analysis should take.
Before getting to the really fun part, we often need to filter and transform our data to reduce it in size and make it more comparable. These steps can be performed by our clustering applications, but it’s also straightforward to do it in any spreadsheet or matrix-friendly environment. Since a bunch of genes may not be expressed in our selected cells or conditions, we can safely get rid of those first. We can keep going with the rest of the genes, but if we can figure out a way to further filter out not-so-interesting genes (like those that aren’t differentially expressed), it might be worth dropping those too. After some normalization, if not previously applied, and calculation of log2-transformed ratios, we’re ready for clustering. Our general aim of clustering is simply to reorder or group genes and/or experimental conditions so similar patterns of expression appear near to each other, letting us more intuitively see biologically interesting variation.
A Few Good Options
The original clustering/visualization package is Michael Eisen’s Cluster and TreeView, both created more than 15 years ago. The sensibly named Cluster application can do hierarchical clustering, self-organizing maps, and even principal component analysis. The input file format is a simple matrix with a header line describing samples and the first column with gene labels. Starting with data like a SOFT file from the NCBI Gene Expression Omnibus repository, you just need to delete the top annotation lines, and you’re ready to go. The program’s output is text files with the reordered matrix and — for hierarchical clustering — files describing the tree (dendrogram) structures, which can then be visualized with TreeView. Be sure to check out the user manual if you can’t decide which distance measure or clustering variation to use. The pair of programs works fine and is still popular, but the main drawback is their availability only for the Windows operating system.
With the good points and major drawback of Cluster/TreeView in mind, other developers re-implemented and enhanced these programs to create Cluster 3.0 and Java TreeView. Both of these run on Windows, Macintosh, and Linux, so hardly anyone is left out. Cluster 3.0, created by Michiel de Hoon, has choices for filtering, transforming, and clustering that are also present in Cluster; in addition, it permits k-means clustering to split your data into k — a number chosen by you — clusters. If you want to process datasets that are big in size or number (or you’re a traditionalist about Unix interfaces), Cluster 3.0 can also be installed as a command-line version. Java TreeView, written by Alok Saldanha, does a great job at visualizing the clustered data. As with the other cluster visualization programs, you can change the usual red-green color scheme — unless you want to make it difficult for any colorblind competitors. Browsing the data, you can zoom in based on branches of the dendrogram, see underlying data by mouseover, and clicks can be linked to Web addresses for more information. One of our favorite features is a slider that lets you optimize the contrast between your two-color extremes. Sometimes we choose to organize the data using our own methods (bypassing Cluster 3.0), and it’s quite easy to create a CDT (“clustered data table”) file that Java TreeView can read. Images can be created in bitmap and postscript formats, and dendrograms can be included if desired.
Besides the above packages exclusively for clustering and visualization, another open source option is the Multi- ExperimentViewer, or MeV, created at The Institute for Genomics Research. MeV is one of the four standalone programs in TIGR’s TM4 suite and can do lots of microarray analysis, including clustering. Even with all of its functionality, MeV has an intuitive interface, and it’s easy to get started — even without reading the very informative manual. Data can be loaded in a variety of formats, including GPR files from a two-color scanner. Designed primarily for two-color data, MeV can also load intensity data including Affymetrix probe set summaries, which are converted to ratios using a method of your choice. The usual types of clustering, distance metrics, and algorithms are available. The heat map visualization is missing some of the flexibility found in Java TreeView. On the other hand, clustered data can also be displayed in a clever type of bar chart, and clustering methods that produce a predefined number of clusters can be displayed as overlapping line graphs. Also, we like the opportunity to input an annotation file and then get the choice of labeling our genes with any of those annotation fields. You can run your data through a series of analyses, and all results are saved as projects for retrieval even at a later date. And while you’re clustering, you may be tempted to try out some statistics like t-tests, ANOVA, or SAM, or even draw a network, volcano plot, or a cool “expression terrain map.”
If we want to cluster a data matrix and display the results as a heat map, all of these programs — Cluster/TreeView, Cluster 3.0/Java TreeView, and MeV — do a great job, and you don’t need to be computer savvy to get started. We prefer Cluster 3.0/Java TreeView for these specific tasks; they’re powerful, flexible, and can generate publication-quality figures. If we want to display our data in a figure other than a heat map, or perhaps do some statistics in the same environment, we head to MeV. Nevertheless, with these programs being open source, it’s good to know that we can try adding our own special options or analysis.
In summary, these applications are all very good at organizing and displaying a large matrix of data. And hopefully somewhere in that heat map are gene expression patterns that will lead us to a better understanding of the mechanisms behind our favorite cellular process — the reason we did that big experiment in the first place.
Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a bioinformatics scientist in Fran’s group.