Skip to main content
Premium Trial:

Request an Annual Quote

A Dim Summary of Microarray Software

Premium

New microarray software is pouring out of the kitchen like dim sum carts at a Chinese brunch. So many exotic looking morsels — which ones to try? If you’re a novice like me, you can’t even tell fish from fowl, beef from pork, main course from dessert.

That’s OK at brunch, where the worst that could happen is you bite into something you hate. But it’s not so good when you’re picking microarray software, every piece of which takes a lot of time to find, install, and try out.

A year ago in this column I conducted taste-tests of eight microarray products and reviewed six more as a mere observer. In this, my second annual review of microarray software, I’ll describe and list the main ingredients of 41 packages.

Dumplings and chicken’s feet

Some of the products are complete meals with dishes for each of the four courses I defined last year — normalization, filtering, clustering (which I now prefer to call numerical analysis), and biological interpretation — while others are just nibbles that satisfy specific analytical needs. There are desktop products intended for individual diners, and enterprise products that are banquets for the whole company. For the daring gastronome, the spiciest software tools are academic packages that implement the latest and greatest ideas of leading researchers — these can turn an ordinary meal into a special feast.

Most full-course products offer bland and standard bill of fare. For normalization, they offer various scaling options to correct for hot and cold chips, accompanied by a platter of simple mathematical operations — such as taking logarithms, calculating ratios or fold change, and centering — for changing how the data look.

For filtering, they let you discard data whose expression levels or changes are either too low for your appetite or are based on properties like the functional classification of a gene.

For numerical analysis (main course), most products serve profile search (which lets you find expression profiles similar to a given one), clustering (hierarchical and otherwise), and principal components analysis (PCA). Several products also entice you with classification methods such as neural nets or support vector machines. For dessert (biological interpretation), most products provide links to external gene sequence or function databases, or permit users to establish such links.

Statistics is the latest taste sensation. A year ago, only the most avant garde products cooked with statistics, but today most products at least dabble with this spice. Statistical methods are becoming more sophisticated as well, reflecting the tremendous work being done by academic microarray statisticians.

What follows is a lot to digest, but I hope my quick tour of microarray dim sum will help you separate the shu mai from the wontons. You may not get exactly what you want, but at least it will be off the right cart.

 

Full course commercial dinners for one

 

Here is the list of full-course, commercial, desktop products. I only mention unusual features of each one, leaving unstated the common features described above.

 

BioMine, Gene Network Sciences
www.gnsbiotech.com
Unique clustering methods; statistical methods for validating clusters; experiment design tool

BioMiner, MicroDiscovery www.microdiscovery.de
Novel normalization methods; support vector machines

GeneLinker, Molecular Mining
www.molecularmining.com
Novel mutual nearest neighbors clustering

GeneMaths, Applied Maths
www.applied-maths.com
Statistical methods for validating clusters; discriminant analysis

GenePlus, Enodar www.enodar.com
Regression techniques to assess significance of expression changes

GeneSight, BioDiscovery www.biodiscovery.com
Novel normalization methods to correct dye bias; discriminant analysis

GeneSpring, Silicon Genetics
www.sigenetics.com
A market leader that represents the gold standard for this class of product. Discriminant analysis; scripting tools

J-Express, MolMine www.molmine.com
Academic version still available from GeneX. Sammon maps; multidimensional scaling

Partek, Partek
www.partek.com
A rising statistical heavyweight. More than 20 distance measures for clustering; neural nets, discriminant analysis; multidimensional scaling; scripting tools

Pathways 4, ResGen
pathways.resgen.com
Modular architecture for plug-in extensions

Spotfire DecisionSite for Functional Genomics, Spotfire
www.spotfire.com
A market leader that integrates microarray tools with Spotfire’s visualization engine. Formerly called Array Explorer. Offers few novel microarray features per se, but the integration with Spotfire’s visualization tools is delectable. Can cluster on text, ontology classifications, etc., in addition to expression values

Xpression NTI, InforMax www.informaxinc.com
Can filter data by variability, e.g., to eliminate data deemed to be unreliable; novel QT-clustering; Sammon maps

 

Full course academic dinners for one

 

Now for the academic meals:

 

BRB ArrayTools,
Simon Rich, NCI
linus.nci.nih.gov/BRB-ArrayTools.html
Excel plug-in; statistical methods for validating clusters; novel classification method; multidimensional scaling

Cluster/Treeview,
Michael Eisen, LBNL rana.lbl.gov
A market leader that pioneered clustering and other aspects of microarray analysis, its data format is a de facto standard. It has no unique features, because everyone has copied it.

MAExplorer, Laboratory of Experimental and Computational Biology, NCI www.lecb.ncifcrf.gov/mae
Java program that can run as standalone application or applet

TIGR MultipleExperiment Viewer (TMEV) The Institute
for Genomic Research
www.tigr.org/softlab
Java application

XCluster, Gavin Sherlock, Stanford
genome-www.stanford.edu/~sherlock /cluster.html
Another pioneering academic program, similar to Cluster; runs on Unix and Linux

 

Commercial nibbles

 

The next group of snacks offer unique features for specific problems. Except as noted, all are desktop products.

 

ArrayStat, Imaging Research imaging.brocku.ca/products
Robust statistical methods to estimate measurement error

BioinformatiXEngine, Xpogen www.xpogen.co
Web-based product intended for use on intranet. Novel clustering method based on relevance networks; modular architecture for plug-in extensions

OmniViz Pro, OmniViz www.omniviz.com
Impressive collection of novel visualization and dimensional reduction methods

Visual Gene, Visipoint
www.visipoint.fi
Uses self-organization maps for analysis and visualization, in contrast to most products that use SOMs only for clustering

 

 

Commercial banquets

 

Enterprise products provide an integrated multi-course banquet — consisting of tools and a central database — to feed an entire research organization. Several of the vendors go further and offer a complete meal plan of software tools for other areas of bioinformatics. These products are great if you like the cuisine.

 

The market leaders in this category are GeneData’s Expressionist (www.genedata.com) and Rosetta’s Resolver (www.rosettabio.com). Resolver was one of the first commercial products to cook with statistics. The product calculates error estimates and propagates these through the analysis. The two largest bioinformatics software vendors, Lion Bioscience and InforMax, also have products in this category: ArrayScout (www.lionbioscience.com) and GenoMax Gene Expression Module (www.informaxinc.com), respectively. A fascinating new product is GeneTraffic from Iobion Informatics (www.iobion.com). GeneTraffic is a network appliance that runs on dedicated, inexpensive Linux computers.

 

Bring on the spice

 

The real hot things are academic dishes that push the frontiers of microarray analysis. This software is not for the faint of heart. Some programs are command line utilities, and many others are code libraries or subroutines. A few have Web versions, but usually these are just demos that offer a quick taste. Much of this software is open source, some of which is available from the GeneX project (genex.ncgr.org/) at the National Center for Genome Research; GeneX also operates a website where these tools can be tried out.

Several of the programs implement versions of a technique called borrowing power described in the box on page 64.

BCLUST, Hongyu Zhao, Yale University School of Medicine
bioinformatics.med.yale.edu
Statistical method for validating clusters using bootstrapping

CLEAVER (Classification of Expression Arrays) Russ Altman, Stanford University classify.stanford.edu
Web server that provides k-means clustering, discriminant analysis, and PCA

CLICK, Ron Shamir and Roded Sharan, Tel Aviv University www.math.tau.ac.il/~roded/click.html
Novel clustering algorithm that uses graph-theoretic and statistical techniques

CLUSFAVOR, Leif Peterson, Baylor mbcr.bcm.tmc.edu/genepi
Bayesian methods for normalization; factor analysis (similar to PCA)

CyberT, Tony Long, University of California, Irvine genebox.ncgr.org/genex/cybert
Part of GeneX. Borrows power and then uses a Bayesian model to assess the significance of expression changes

GEDA: Gene Expression Data Analysis, Christina Kendziorski, University of Wisconsin www.biostat.wisc.edu/geda/eba.html
A highly referenced program that also has a Web version and can be accessed via email. Borrows power and then uses a Bayesian model to assess the significance of expression changes.

K-means Integrated Models for Oligonucleotide Arrays (Kimono) Ian Holmes, Berkeley Drosophila Genome Project
whitefly.lbl.gov/~ihh/kimono
Jointly clusters promoter sequences and expression profiles to find promoters that regulate various genes

MA-ANOVA programs for microarray data, Gary Churchill, Jackson Laboratory www.jax.org/research/churchill
Implements pioneering ANOVA error model that handles many kinds of measurement errors

microarray.zip, Brian Yandell, University of Wisconsin www.stat.wisc.edu/~yandell/statgen /tr1031.html
Borrows power and uses results to improve measurements of low-abundance transcripts

PaGE, Christian J. Stoeckert,
Penn Center for Bioinformatics, University of Pennsylvania www.cbil.upenn.edu/PaGE
Borrows power and then computes confidence levels for direction, but not magnitude, of expression change.

Plaid, Laura Lazzeroni and Art Owen, Stanford University www-stat.stanford.edu/~owen/plaid
Implements new “fuzzy” clustering method that clusters genes and samples simultaneously. Not open source.

RCluster, Karen Schlauch, National Center for Genome Research genex.ncgr.org/genex/ rcluster/help.html
Part of GeneX. Implements several standard clustering methods, and statistical method for validating clusters using bootstrapping.

SAM: Significance Analysis of Microarrays, Rob Tibshirani, Stanford University
www-stat.stanford.edu/~tibs
Excel plug-in that correlates gene expression data with clinical parameters

SMA: Statistics for Microarray Analysis, Terry Speed, University of California, Berkeley
www.stat.berkeley.edu/users/terry/ zarray/Software/smacode.html
Influential suite of programs, providing basic microarray statistical routines. Also provides normalization functions that correct dye bias and print tip effects.

SVDMAN: Singular Value Decomposition Microarray Analysis, Michael Wall, Los Alamos National Laboratory
public.lanl.gov/mewall/svdman
Uses singular value decomposition (similar to PCA) to partially cluster genes; also calculates confidence measures for clusters.

VERA: Variability and Error Assessment & SAM:
Significance of Array Measurement, Trey Ideker, Institute for Systems Biology
www.systemsbiology.org/VERA andSAM/?id=yvfw4
A pair of programs for assessing significance of expression changes using statistical error models

 

Borrowing Power

A pressing issue in microarray statistics is finding ways to increase power without increasing the number of replicates. This reflects the cold reality that microarray experiments are too expensive for statisticians to do as many as they would like.

One approach is to combine data from multiple genes to better estimate the variance of each one. An (overly) simple idea is to assume that all genes are subject to the same amount of uncontrolled variation. Given this assumption, we can combine the measurements for all genes into one large pool, increasing the effective sample size from the number of replicates (a small number like two or three) to the number of replicates multiplied by the number of genes — a large number like 20,000-30,000, which is certainly large enough to make accurate statistical estimates.

The rub, of course, is that the basic assumption is false. The idea can be resurrected by adopting the weaker assumption that all genes of a given expression level show the same variation, or better that expression level is a major component of the variation. This is a promising idea that is being pursued in different forms by many microarray statisticians.

— NG

 

Nat Goodman, PhD, helped found the Whitehead/MIT Center for Genome Research, directed a bioinformatics group at the Jackson Laboratory, led a bioinformatics marketing team for Compaq Computer, and has been consulting ever since. He is currently a free agent in Seattle. Send your comments to Nat at [email protected]

 

The Scan

And Back

The New York Times reports that missing SARS-CoV-2 genome sequences are back in a different database.

Lacks Family Hires Attorney

A lawyer for the family of Henrietta Lacks plans to seek compensation from pharmaceutical companies that have used her cancer cells in product development, the Baltimore Sun reports.

For the Unknown

The Associated Press reports that family members are calling on the US military to use new DNA analysis techniques to identify unknown sailors and Marines who were on the USS Arizona.

PLOS Papers on Congenital Heart Disease, COVID-19 Infection Host MicroRNAs, Multiple Malformation Mutations

In PLOS this week: new genes linked to congenital heart disease, microRNAs with altered expression in COVID-19, and more.