New microarray software is pouring out of the kitchen like dim sum carts at a Chinese brunch. So many exotic looking morsels — which ones to try? If you’re a novice like me, you can’t even tell fish from fowl, beef from pork, main course from dessert.
That’s OK at brunch, where the worst that could happen is you bite into something you hate. But it’s not so good when you’re picking microarray software, every piece of which takes a lot of time to find, install, and try out.
A year ago in this column I conducted taste-tests of eight microarray products and reviewed six more as a mere observer. In this, my second annual review of microarray software, I’ll describe and list the main ingredients of 41 packages.
Dumplings and chicken’s feet
Some of the products are complete meals with dishes for each of the four courses I defined last year — normalization, filtering, clustering (which I now prefer to call numerical analysis), and biological interpretation — while others are just nibbles that satisfy specific analytical needs. There are desktop products intended for individual diners, and enterprise products that are banquets for the whole company. For the daring gastronome, the spiciest software tools are academic packages that implement the latest and greatest ideas of leading researchers — these can turn an ordinary meal into a special feast.
Most full-course products offer bland and standard bill of fare. For normalization, they offer various scaling options to correct for hot and cold chips, accompanied by a platter of simple mathematical operations — such as taking logarithms, calculating ratios or fold change, and centering — for changing how the data look.
For filtering, they let you discard data whose expression levels or changes are either too low for your appetite or are based on properties like the functional classification of a gene.
For numerical analysis (main course), most products serve profile search (which lets you find expression profiles similar to a given one), clustering (hierarchical and otherwise), and principal components analysis (PCA). Several products also entice you with classification methods such as neural nets or support vector machines. For dessert (biological interpretation), most products provide links to external gene sequence or function databases, or permit users to establish such links.
Statistics is the latest taste sensation. A year ago, only the most avant garde products cooked with statistics, but today most products at least dabble with this spice. Statistical methods are becoming more sophisticated as well, reflecting the tremendous work being done by academic microarray statisticians.
What follows is a lot to digest, but I hope my quick tour of microarray dim sum will help you separate the shu mai from the wontons. You may not get exactly what you want, but at least it will be off the right cart.
Full course commercial dinners for one
Here is the list of full-course, commercial, desktop products. I only mention unusual features of each one, leaving unstated the common features described above.
BioMine, Gene Network Sciences
www.gnsbiotech.com
Unique clustering methods; statistical methods for validating clusters; experiment design tool
BioMiner, MicroDiscovery www.microdiscovery.de
Novel normalization methods; support vector machines
GeneLinker, Molecular Mining
www.molecularmining.com
Novel mutual nearest neighbors clustering
GeneMaths, Applied Maths
www.applied-maths.com
Statistical methods for validating clusters; discriminant analysis
GenePlus, Enodar www.enodar.com
Regression techniques to assess significance of expression changes
GeneSight, BioDiscovery www.biodiscovery.com
Novel normalization methods to correct dye bias; discriminant analysis
GeneSpring, Silicon Genetics
www.sigenetics.com
A market leader that represents the gold standard for this class of product. Discriminant analysis; scripting tools
J-Express, MolMine www.molmine.com
Academic version still available from GeneX. Sammon maps; multidimensional scaling
Partek, Partek
www.partek.com
A rising statistical heavyweight. More than 20 distance measures for clustering; neural nets, discriminant analysis; multidimensional scaling; scripting tools
Pathways 4, ResGen
pathways.resgen.com
Modular architecture for plug-in extensions
Spotfire DecisionSite for Functional Genomics, Spotfire
www.spotfire.com
A market leader that integrates microarray tools with Spotfire’s visualization engine. Formerly called Array Explorer. Offers few novel microarray features per se, but the integration with Spotfire’s visualization tools is delectable. Can cluster on text, ontology classifications, etc., in addition to expression values
Xpression NTI, InforMax www.informaxinc.com
Can filter data by variability, e.g., to eliminate data deemed to be unreliable; novel QT-clustering; Sammon maps
Full course academic dinners for one
Now for the academic meals:
BRB ArrayTools,
Simon Rich, NCI
linus.nci.nih.gov/BRB-ArrayTools.html
Excel plug-in; statistical methods for validating clusters; novel classification method; multidimensional scaling
Cluster/Treeview,
Michael Eisen, LBNL rana.lbl.gov
A market leader that pioneered clustering and other aspects of microarray analysis, its data format is a de facto standard. It has no unique features, because everyone has copied it.
MAExplorer, Laboratory of Experimental and Computational Biology, NCI www.lecb.ncifcrf.gov/mae
Java program that can run as standalone application or applet
TIGR MultipleExperiment Viewer (TMEV) The Institute
for Genomic Research
www.tigr.org/softlab
Java application
XCluster, Gavin Sherlock, Stanford
genome-www.stanford.edu/~sherlock /cluster.html
Another pioneering academic program, similar to Cluster; runs on Unix and Linux
Commercial nibbles
The next group of snacks offer unique features for specific problems. Except as noted, all are desktop products.
ArrayStat, Imaging Research imaging.brocku.ca/products
Robust statistical methods to estimate measurement error
BioinformatiXEngine, Xpogen www.xpogen.co
Web-based product intended for use on intranet. Novel clustering method based on relevance networks; modular architecture for plug-in extensions
OmniViz Pro, OmniViz www.omniviz.com
Impressive collection of novel visualization and dimensional reduction methods
Visual Gene, Visipoint
www.visipoint.fi
Uses self-organization maps for analysis and visualization, in contrast to most products that use SOMs only for clustering
Commercial banquets
Enterprise products provide an integrated multi-course banquet — consisting of tools and a central database — to feed an entire research organization. Several of the vendors go further and offer a complete meal plan of software tools for other areas of bioinformatics. These products are great if you like the cuisine.
The market leaders in this category are GeneData’s Expressionist (www.genedata.com) and Rosetta’s Resolver (www.rosettabio.com). Resolver was one of the first commercial products to cook with statistics. The product calculates error estimates and propagates these through the analysis. The two largest bioinformatics software vendors, Lion Bioscience and InforMax, also have products in this category: ArrayScout (www.lionbioscience.com) and GenoMax Gene Expression Module (www.informaxinc.com), respectively. A fascinating new product is GeneTraffic from Iobion Informatics (www.iobion.com). GeneTraffic is a network appliance that runs on dedicated, inexpensive Linux computers.
Bring on the spice
The real hot things are academic dishes that push the frontiers of microarray analysis. This software is not for the faint of heart. Some programs are command line utilities, and many others are code libraries or subroutines. A few have Web versions, but usually these are just demos that offer a quick taste. Much of this software is open source, some of which is available from the GeneX project (genex.ncgr.org/) at the National Center for Genome Research; GeneX also operates a website where these tools can be tried out.
Several of the programs implement versions of a technique called borrowing power described in the box on page 64.
BCLUST, Hongyu Zhao, Yale University School of Medicine
bioinformatics.med.yale.edu
Statistical method for validating clusters using bootstrapping
CLEAVER (Classification of Expression Arrays) Russ Altman, Stanford University classify.stanford.edu
Web server that provides k-means clustering, discriminant analysis, and PCA
CLICK, Ron Shamir and Roded Sharan, Tel Aviv University www.math.tau.ac.il/~roded/click.html
Novel clustering algorithm that uses graph-theoretic and statistical techniques
CLUSFAVOR, Leif Peterson, Baylor mbcr.bcm.tmc.edu/genepi
Bayesian methods for normalization; factor analysis (similar to PCA)
CyberT, Tony Long, University of California, Irvine genebox.ncgr.org/genex/cybert
Part of GeneX. Borrows power and then uses a Bayesian model to assess the significance of expression changes
GEDA: Gene Expression Data Analysis, Christina Kendziorski, University of Wisconsin www.biostat.wisc.edu/geda/eba.html
A highly referenced program that also has a Web version and can be accessed via email. Borrows power and then uses a Bayesian model to assess the significance of expression changes.
K-means Integrated Models for Oligonucleotide Arrays (Kimono) Ian Holmes, Berkeley Drosophila Genome Project
whitefly.lbl.gov/~ihh/kimono
Jointly clusters promoter sequences and expression profiles to find promoters that regulate various genes
MA-ANOVA programs for microarray data, Gary Churchill, Jackson Laboratory www.jax.org/research/churchill
Implements pioneering ANOVA error model that handles many kinds of measurement errors
microarray.zip, Brian Yandell, University of Wisconsin www.stat.wisc.edu/~yandell/statgen /tr1031.html
Borrows power and uses results to improve measurements of low-abundance transcripts
PaGE, Christian J. Stoeckert,
Penn Center for Bioinformatics, University of Pennsylvania www.cbil.upenn.edu/PaGE
Borrows power and then computes confidence levels for direction, but not magnitude, of expression change.
Plaid, Laura Lazzeroni and Art Owen, Stanford University www-stat.stanford.edu/~owen/plaid
Implements new “fuzzy” clustering method that clusters genes and samples simultaneously. Not open source.
RCluster, Karen Schlauch, National Center for Genome Research genex.ncgr.org/genex/ rcluster/help.html
Part of GeneX. Implements several standard clustering methods, and statistical method for validating clusters using bootstrapping.
SAM: Significance Analysis of Microarrays, Rob Tibshirani, Stanford University
www-stat.stanford.edu/~tibs
Excel plug-in that correlates gene expression data with clinical parameters
SMA: Statistics for Microarray Analysis, Terry Speed, University of California, Berkeley
www.stat.berkeley.edu/users/terry/ zarray/Software/smacode.html
Influential suite of programs, providing basic microarray statistical routines. Also provides normalization functions that correct dye bias and print tip effects.
SVDMAN: Singular Value Decomposition Microarray Analysis, Michael Wall, Los Alamos National Laboratory
public.lanl.gov/mewall/svdman
Uses singular value decomposition (similar to PCA) to partially cluster genes; also calculates confidence measures for clusters.
VERA: Variability and Error Assessment & SAM:
Significance of Array Measurement, Trey Ideker, Institute for Systems Biology
www.systemsbiology.org/VERA andSAM/?id=yvfw4
A pair of programs for assessing significance of expression changes using statistical error models
Borrowing Power
A pressing issue in microarray statistics is finding ways to increase power without increasing the number of replicates. This reflects the cold reality that microarray experiments are too expensive for statisticians to do as many as they would like.
One approach is to combine data from multiple genes to better estimate the variance of each one. An (overly) simple idea is to assume that all genes are subject to the same amount of uncontrolled variation. Given this assumption, we can combine the measurements for all genes into one large pool, increasing the effective sample size from the number of replicates (a small number like two or three) to the number of replicates multiplied by the number of genes — a large number like 20,000-30,000, which is certainly large enough to make accurate statistical estimates.
The rub, of course, is that the basic assumption is false. The idea can be resurrected by adopting the weaker assumption that all genes of a given expression level show the same variation, or better that expression level is a major component of the variation. This is a promising idea that is being pursued in different forms by many microarray statisticians.
— NG
Nat Goodman, PhD, helped found the Whitehead/MIT Center for Genome Research, directed a bioinformatics group at the Jackson Laboratory, led a bioinformatics marketing team for Compaq Computer, and has been consulting ever since. He is currently a free agent in Seattle. Send your comments to Nat at [email protected]