In his first gene expression analysis experiment, NAT GOODMAN finds that gene chip informatics has moved from bleeding edge to leading edge.
Gene chip informatics tools have matured to the point that you can now do useful work without having to become a guru. Laboratory technology is solid enough that you’re not constantly fighting data quality battles, and there are workable solutions to most major informatics problems.
While conducting my own first gene-chip project recently, I kept notes from experiments I conducted using two datasets and two software packages.
One set of data I used came from the spectacular study of gene expression in breast cancer conducted by the Brown and Botstein laboratories at Stanford University (published in Nature August 17 and reviewed in Science September 8).
The study used spotted cDNA microarrays to measure expression levels for about 8,100 genes in 65 surgical specimens (mostly tumors) from 42 patients — about 525,000 data points in all. The authors have generously made their dataset publicly available on the Web. It is a wonderful resource for getting started.
I also used as-yet-unpublished data from an ongoing study of a mouse model of Huntington’s Disease. Jim Olson of the Fred Hutchinson Cancer Research Center coordinated the study among his and several other laboratories. They use the Affymetrix platform to measure expression levels for about 6,600 genes for wild type vs. HD-affected animals in cell types that are resistant to the disease vs. ones that are susceptible, across a range of time points in disease progression, and for several therapeutic interventions; each test case will be run twice, and preferably thrice if the money holds out.
Dr. Olson kindly gave me access to the dataset from the ground stage of the project, which contains data for wild type vs. affected animals in two cell types at one time point with no therapeutic intervention. With each test case run twice, this adds up to about 50,000 datapoints.
Two other great sites for getting started are the Stanford and EBI microarray sites (see table).
Chip Mining Software
I asked gene chip users around the industry what software packages they like and got a pretty consistent set of answers (see list). I arbitrarily chose two.
One is Cluster and its companion TreeView. Michael Eisen of Stanford University developed this leading academic package, which has been used in several groundbreaking projects including the Stanford breast cancer study.
I also checked out Spotfire’s package: Spotfire.net and Array Explorer, a widely used data mining and visualization product augmented with a gene chip plug-in.
Both tools are Windows programs without Unix versions. I have not yet managed to get either program working real well on the Huntington’s data set: Cluster processed the data fine, but TreeView choked when asked to visualize the results; Spotfire did a fine job at visualization, but Array Explorer wouldn’t do the analysis.
These programs come so highly recommended that I am willing to bet that the problem is due to my lack of proficiency rather than to bugs in the software, but this serves as a cautionary note that these packages come with a steep learning curve.
Cluster analysis is probably the best-known method for analyzing gene chip datasets. The idea is to group together subsets of the data, either genes or samples, on the basis of “similarity.” The breast cancer study, for example, used hierarchical clustering to show that samples taken from the same patients were generally more similar than samples from different patients, and then classify the tumors into distinct sub-groups, some of which were correlated with clinical outcomes. In the Huntington’s study, the obvious use for clustering is to identify genes whose expression levels vary in a coordinated way as the disease progresses.
Cluster analysis is a standard problem in mathematical data analysis and the methods used for gene chips are variants of well-known techniques. The user manual for Cluster has a nice overview of the major techniques including hierarchical, k-means, and self-organizing maps.
I also found myself doing a lot of routine data analysis. For example, with the Huntington’s dataset, I wanted to explore the consistency of the two runs of each test case. A natural way to do this is to draw a scatter plot of the replicas, and see if the points fall nicely along a straight line (which they do).
I then analyzed how the standard deviation varies as a function of the mean (it’s proportional to the mean, as one would expect). Next step is to fit the data to a regression line and use this to flag replicas whose variation is unreasonably large. The range of such analyses is virtually limitless, and it makes sense to use a general mathematical package, such as Matlab.
I was amazed to learn that statistical issues have not been a central theme in this field to date. People don’t generally put error bars on their results, for example. The rationale seems to be that the data quality has traditionally been too low to permit robust statistical modeling (which makes me wonder how it was good enough to get published, but that’s another story).
I admit I’m a statistical airhead, but the data I’m working with looks fine — easily good enough to support statistical rigor.
Plotting the Pie
Visualization is another piece of the informatics pie. This is where Spotfire shines. The program provides a palette of graph types and allows you to select which columns of data to graph on each axis. With a few mouse clicks, I was able to display scatter plots of the replicas for each of the four experimental cases in the Huntington’s dataset.
The program has a nice graph type — the profile chart — for visualizing entire expression profiles. You choose which columns you want to plot in what order and the program draws a line chart connecting the dots in the given order. For example, I chose to plot wild type/susceptible first, then HD-affected/susceptible, then wild type/resistant, then HD-affected/ resistant.
When displaying the entire dataset, I saw a huge smear at the bottom of the chart reflecting the large number of lowly expressed genes. But there was also a reasonable number of lines poking up from the smear, representing genes whose expression level rose above the mass in at least one case.
A unique strength of Spotfire is that you can interactively control what data are displayed by adjusting controls on the screen. By adjusting the sliders, I could suppress the display of genes that were low throughout, and focus on genes that were high in susceptible cells and low in resistant ones (I found 64 of these), and genes that were low in susceptible cells and high in resistant ones (43 of these).
People rave about Spotfire, and while I agree it’s a keeper, I had some minor frustrations. For instance, I couldn’t figure out how to precisely select the ranges of data to be displayed, or how to control the bin sizes in histograms to display distributions accurately. My biggest frustration was that you can’t do serious computation within the program: if you need to compute something complicated, you have to step outside the program to compute and re-import the dataset.
I’d like to see a marriage of Spotfire’s visualization prowess with Matlab’s mathematical sophistication — now that would be a great datamining product! These difficulties may reflect my inexperience with Spotfire, but I urge you to try it on a serious example before deciding whether it’s right for you.
You won’ t be surprised to hear that data slinging is a big headache when working with so many datasets and programs. I wrote the usual pile of Perl scripts to import data from the two studies, convert them to a common format, combine and extract data, and export datasets to the various programs in the formats they demand. Not a big deal, but it consumes a lot of time, especially since bugs in these mundane scripts wreak havoc on everything you do downstream.
I see a real need for a simple data management tool that can import and export data in the common formats, and let you slice and dice the dataset at will. There are several gene chip databases out there, but I haven’t looked at these yet.
At the end of all this work, you end up with a list of tens to hundreds of candidate genes. Presumably the next step is to figure out what the genes do and decide which ones look interesting from a biological standpoint. For characterized genes, this is just an exercise in database lookup, but for ESTs you’re going to have to do some sequence analysis. Boy, it sure would be nice if someone would build a gene-centric database that pulls this information together once and for all; the “Field of Genes” I talked about last month would really hit the spot!
Gene chip informatics has turned the corner and is ripe for new people to get involved. The edges are still rough, but that’s the way it goes in a field that moves as fast as bioinformatics. A complete solution will include capabilities for cluster analysis (e.g. Eisen’s Cluster program), general mathematical analysis (e.g., Matlab), visualization (e.g., Spotfire), and data management.
Some products claim to be complete solutions, but I suspect most people will cobble together systems from piece parts using Perl as the glue. This is standard fare in bioinformatics and should pose no special threat.
Though the technology is still changing rapidly, it’s safe to get started with the components that are available today and go with the flow.
Users’ Favorites: Six Recommended Tools for Microarray Mining
Cluster & TreeView Michael Eisen, Stanford University http://www.microarrays.org
Spotfire.net & Array Explorer — Spotfire http://www.spotfire.com
Gene Spring — Silicon Genetics http://www.sigenetics.com
GeneMaths — Applied Maths http://www.applied-maths.com
Expressionist — GeneData http://www.genedata.com
Resolver — Rosetta http://www.rii.com
Cyberspots for Gene Expression Analysis
Stanford microarray site
EBI microarray site
Stanford breast cancer data