Sampling the Menu of Microarray Software
Nat Goodman recommends a four-course approach to analysis
Hungry for gene-expression analysis solutions? The microarray software BILL OF FARE now offers a delectable assortment of products for your data analysis pleasure.
There’s a product for every pocketbook, ranging from academic freeware to hundred-thousand-dollar enterprise solutions. You can choose from among standalone desktop applications, Web servers, and client-server systems. Or pick the raw ingredients and cook your solution from scratch ¯ like the typical bioinformaticist.
I sampled as many products as I could. Web servers were easiest to savor with my browser. Stand-alone programs were simple too, by download.
Client-server programs were nearly impossible to assess because vendors wouldn’t provide such complex software systems to a non-customer. I had to settle for live presentations and demonstrations at vendors’ sites — a less than satisfying way to taste-test.
I was able to graze freely on eight programs — five academic and three commercial — and to see live demos of three other commercial products. The table on page 42 lists these 11 products and includes basic information on three other client-server products that I didn’t try. I hope to repeat these tests periodically and to include additional programs I might have missed: Be sure to tell me about those.
All programs I tested are intended for interactive use by end-users. They place heavy emphasis on data visualization and generally provide no application programming interfaces, making it hard for bioinformaticists to integrate these products into concoctions of their own.
I adopted a standard test procedure to assess whether each program worked, and to get a subjective sense of how well I liked its look and feel. Exceptions are that Spotfire treated me to an in-house demo and training session, and I was already familiar with the GeneSpring product because my company is working with the vendor in support of a mutual client.
The first two programs in the table — Cluster/Treeview by Mike Eisen and GeneCluster from Whitehead are seminal academic packages that helped launch the entire field. Unfortunately, but not surprisingly, these packages have fallen behind as their authors have moved on to new challenges.
The next three programs — EPCLUST, Expression Browser, and J-Express — are first of a new breed. They’re all still a little raw, but keep an eye on them — it won’t be long before one or more is ready for market.
EPCLUST is a Web server. A key problem is that its user interface is HTML-based and too clumsy; the program will need a Java client to be competitive.
I couldn’t get Expression Browser to do much even on the example dataset provided by its author. From the little I saw, it seemed to include nice visualizations for large dendograms (the tree diagrams produced by hierarchical clustering); this is important because typical dendograms are too big to view on a single computer screen.
J-Express is the clear pick of the bunch. It provides a comprehensive set of clustering methods and informative visualizations that make good use of linked windows. I found the program intuitive and have high hopes for it.
The next three programs — GeneSpring, GeneMaths (formerly GenExplore), and Spotfire Array Explorer — are excellent commercial products with complementary strengths.
GeneSpring is, by many accounts, the market leader. The product is complete and complex. It includes a large number of special purpose analyses useful for specific kinds of experiments, and the company seems committed to adding more such methods as customers demand. The flip side is that I found it hard to use the product for analyses that didn’t fit the GeneSpring recipe. Also, I found the user interface to be counterintuitive with important capabilities buried in hard-to-find or poorly named menus.
GeneMaths offers a good balance to GeneSpring. It, too, is reasonably complete but far less complex. I found the user interfaces and visualizations to be a lot more user-friendly.
Spotfire’s clear strength is in visualization. The program makes good use of linked windows and provides intuitive ways to use the product’s powerful, general-purpose visualization tools to look at gene expression. The drawback of the program is its limited assortment of clustering tools compared to GeneSpring and GeneMaths.
Client is served
As for client-server products, the version of Lion’s ArrayScout that I saw was still in beta and didn’t demo very well. It has since been released with improvements.
Both Resolver and Expressionist demoed well. The products appear to be on a shelf above the desktop products, offering more sophisticated analyses and more polished user interfaces. I was especially impressed with Resolver’s use of statistical error models to establish confidence levels for all analytical results. I was also pleased to see a recent press release announcing a partial opening of its API. Expressionist is notable for its wide assortment of intuitive, statistical methods for deciding whether genes are differentially expressed across experiments and its technology-specific quality controls. I can’t judge whether my favorable impressions would survive the scrutiny of hands-on testing, or whether the benefits of these products justify their hefty price tags.
For the products I actually tested, I challenged the programs with a small dataset containing expression levels for about 1,900 genes across eight experiments, and a medium dataset of 5,600 genes across eight experiments. I also tried some of the programs on a large dataset of 13,000 genes across 17 experiments. The table on this page summarizes my findings for those aspects of the programs that could be readily tabulated.
Typical data analysis is undertaken in four courses. Normalization and filtering are your soup and salad. Clustering is the main course, followed by biological interpretation for dessert.
Normalization copes with experimental variability and the large dynamic range of expression levels for different genes. Filtering eliminates data that are clearly uninteresting, typically because the gene does not vary across experiments. Clustering seeks to find patterns in filtered data, often by grouping genes or experiments into classes based on the similarity of their expression profiles. This final step, biological interpretation, aims to relate patterns to biological hypotheses. This step is mostly manual and requires considerable scientific creativity. Programs help by providing convenient connections to external biological databases.
Setting the table
The first block in the table on page 44 tells whether the program worked on the sample datasets. All but one worked on the small dataset; the one that didn’t was Expression Browser. All but two (Expression Browser and J-Express) handled the medium data set, but analyses were slower and visualizations less effective. None of the programs worked the first time on the large dataset, but I got several to work by adjusting the settings.
The second block of the table looks at normalization options. Programs vary considerably here and the table just gives a flavor of what each product offers. My sense is that it’s wiser to do normalization in separate code outside the packages. The provided normalizations are pretty simple anyway and easy to program. If you’re planning to use multiple packages, you’ll need to normalize on the outside so that your data are the same for all packages.
Normalization is a hot research area now, and you may want to incorporate new methods without waiting for the vendors. An excellent resource for this and other aspects of microarray statistics is Terry Speed’s site, http://www.stat.berkeley.edu/ users/terry/zarray/Html/index.html.
I didn’t include filtering options because programs vary too much. Again, it makes sense to do it outside the programs.
The next block in the table looks at clustering methods. J-Express offers the most complete set, although two unique methods are still under development.
The following two blocks look at technical aspects of clustering methods. The first is a list of schemes provided for measuring the similarity of expression profiles. (One caveat is that numerous options exist for each similarity measure and different programs may offer different ones.)
The block to the right of that one lists the methods used to aggregate the multiple profiles in a cluster for purposes of similarity computations. The information in these two blocks applies to hierarchical clustering only; the programs generally support fewer options for other clustering methods.
I did not test the ability of the products to support biological interpretation. Most are able to connect to external biological databases and import information about genes of interest. Each product does it differently though, and there was not time to get them all configured to test this feature. This is a critically important aspect of the programs, which I’ll address in later test rounds.
Looking at the table as a whole, the take-home message is clear: the vendors are selecting their ingredients from the same grocery list, and no program is clearly dominant.
The three commercial programs that I tested are all winners with complementary strengths. If you can afford it, you should put all three in your shopping bag.
The academic J-Express is promising, and it too belongs in your bag.
Rosetta and Expressionist look impressive, as well, but I can’t be sure as the vendors let me look but not taste.
Perhaps it’s a mistake, though, to focus too much on the current products. Microarray informatics seems destined to repeat the history we experienced with sequence informatics: Lots of new methods are coming from academia, and to exploit these advances quickly, you have to grab the ingredients and prepare the meal yourself. If this happens, the importance of commercial products will diminish as we’ve seen so many times before.
|Cluster / Treeview||Mike Eisen, Lawrence Berkeley National Laboratory||A||Desktop||22.214.171.124 / 126.96.36.199||http://rana.lbl.gov|
|EPCLUST||Jaak Vilo, European Bioinformatics Institute||A||Web-server||0.9.06||http://ep.ebi.ac.uk|
|J-Express||Bjarte Dysvik and Inge Jonassen, University of Bergen. Part of GeneX suite from US National Center for Genome Resources.||A||Desktop||current on Jan 21, 2001||http://www.ii.uib.no/~bjarted/
jexpress. See also http://www.ncgr .
|Expression Browser||Matthew Pocock, Sanger Centre||A||Desktop||current on Jan 21, 2001||http://www.sanger.ac.uk/
|GeneMaths (formerly GenExplore)||Applied Maths||C||Desktop||1.0||http://www.applied-maths.com|
|Array Explorer / Spotfire.net||Spotfire||C||Desktop||3.0 / 5.1||http://www.spotfire.com|
|GeneLinker||Molecular Mining||C||Server||Web visit||http://www.molecularmining.com|
|GenoMax Expression Analysis||InforMax||C||Server||Web visit||http://www.informaxinc.com|
Info from tables on page 44
SPOTFIRE, GENEMATHS, GENESPRING, J-EXPRESS, EXPRESSION BROWSER, EPCLUST, GENECLUSTER, CLUSTER/TREEVIEW
academic vs. commercial
Did program work on dataset? small, medium, large
Normalization options: log transform, center to mean, center to median, center to constant, scale based on variation, use controls, use per-gene references
Clustering methods: Hierarchical - genes, Hierarchical - experiments, K-means, Self-organizing map (SOM), Principal components analysis (PCA), Profile search, Sammon''s map, Karhunen-Loeve transform
Similarity measures: Correlation, Euclidean distance, City block distance, Spearman, Special for continuous, Special for binary, Chord
Aggregation methods: Average, Min (single), Max (complete), Weighted average, Ward, Neighbor joining