Marketeers the world over know that the S-word sells. The human drive for S is so strong that not even the most politically correct modern wo/man can resist its appeal. These forces affect even the staid world of bioinformatics where we are seeing an upsurge in S-related promotion. You know the S-word I’m talking about: Standards.
S-mania has struck the genomics world in the form of a major push to develop comprehensive standards for microarray data. Even sober Science has lauded the movement, suggesting that microarray data won’t reach its potential until standards are in place and enforced.
Different kinds of people need different standards. We’re not so far from standards that will help bioinformaticians who are developing microarray software. But we’re a long way from standards that will help scientists extract scientific knowledge from data. A big part of the problem is that the community has not yet learned how to report the scientific “facts” that come from microarray experiments without getting bogged down in experimental details.
S-appeal is visceral. Standards enable sharing and teamwork — virtues we’re taught at Momma’s knee. Share your toys, win for the team, give to charity, support our troops. These virtues seem necessary for our survival as a social species.
But powerful instincts push in the other direction, too. When you accept a standard, you subordinate your individual free will to the communal consensus. You have to jump on the bandwagon, accept the bad with the good, and surrender your right to creatively improve the standard. Some of us seem programmed to resist the herd.
Successful standards have to get the warm, fuzzy “sharing and teamwork” juices flowing without triggering a nasty “not invented here” rejection. They’re love at first sight. They should cost nothing until you’re ready to use them, and then they should deliver immediate payback.
Getting to know your data
Microarray data is born when a hybridized chip is scanned to create an image. The image is processed by image analysis software to find the spots or cells, and to determine the foreground and background intensities for each. From the intensity data, the software estimates the expression level that can be stated in absolute terms, such as Affymetrix’s notion of average difference, or relative terms, such as ratios or fold-changes, and possibly augmented with quality indicators, such as presence/absence codes or error bars. Up to this point, everything is done on a chip-by-chip basis.
After accumulating expression level data from multiple chips, the real work of data analysis can begin. This is where the Menu of Microarray Software that I discussed here in March enters the scene. Typical tasks include normalization, filtering, pattern discovery, and biological interpretation. The results are scientific “facts” — statements like, “These genes show similar expression patterns over these experimental conditions.”
As always in science, data analysis is ongoing and iterative as scientists yearn to learn as much as they can from the data. They’ll run the data through multiple analysis programs and compare the results, feed the output of one analysis program into another, and so forth. Rarely is one analysis definitive.
How standards can help
From the standpoint of a software developer, standards are useful at all stages, because they make it easier to mix and match solutions for various pieces.
For scientists who are trying to use the data, the benefits are subtler. To understand where standards could help scientific users, let’s reason by analogy to the sequencing world. Microarray images are analogous to the gel images you get from sequencing instruments. Analyzed images are like chromatographs (a.k.a. chromats). Expression levels correspond to base calls with the expression level data for one chip corresponding to the raw sequence text, or read, obtained from one lane or capillary.
Few scientists have much use for data at these fairly raw levels. Sure, specialists who assemble genomic sequences or hunt for SNPs use chromats to decide whether a sequence variation looks real. But this is the exception, not the rule. Indeed, people who work with such raw data often know a lot of experimental details — the sequencing chemistry, vectors, and so forth. For most scientists, interest begins with data that have been analyzed to a level where experimental details don’t matter so much.
Sequencing is mature enough that the expected outputs of data analysis are reasonably well defined. For example, in genomic sequencing, you assemble reads into contigs, and then conduct gene finding and such. In full-length cDNA sequencing, you publish long sequences that you believe come from real transcripts, and then run BLAST or other programs to find putative homologs for the implied proteins. In EST sequencing, you put your reads into a database, and someone clusters them.
The microarray world, meanwhile, is still so young that there is no consensus as to what analytical products should be placed in databases. Should you submit your clusters, or lists of genes showing significant changes in expression levels, or what?
Lacking this consensus, the tendency is to disseminate the raw expression levels. This is analogous to publishing raw sequence reads. While it’s better than nothing, it’s probably not terribly useful in the long run.
The problem is that the scientific user of such data has to know too many experimental details to analyze it in scientifically valid ways. Imagine if you couldn’t analyze sequences without knowing whether they came from double- or single-stranded vectors.
State of the art
Image formats are pretty well standardized already, thanks to decades of experience in the computer industry. You’ve probably heard the names — such as TIFF and GIFF.
But there are no standards for analyzed images, which means that you have to cope with the unique formats of the various image-analysis packages. This is no big deal for those residing in the Affymetrix universe, because there is only one real choice at present — Affy’s own products.
Spotted arrays, for which there are about a dozen widely used commercial and academic packages, are more complicated. Mike Eisen’s ScanAlyze seems to be the leading academic package and, thanks to its wide use in the prominent Stanford-centered microarray universe, may emerge as a de facto standard. De facto standards may also come from major database efforts, such as NCBI’s Gene Expression Omnibus (GEO).
For expression levels, Mike Eisen’s Cluster format is close to a lingua franca. Many data analysis packages support this format in addition to their own proprietary ones, and it seems destined to become the FASTA format of microarrays.
There are no standards for the outputs of analysis programs. Both Eisen’s Cluster format and Spotfire’s program add columns to the dataset — e.g., cluster identifiers — that summarize the analysis. This seems a reasonable approach and may gain acceptance.
A complication is that many investigators invent personalized formats for their analyzed results. This will likely persist until the field reaches a consensus on the best ways to report analytical outputs.
Mainstream work on microarray standarDs
Several groups have been working hard to develop comprehensive standards for microarray data. The Microarray Gene Expression Group (MGED) has been working on an XML standard called MAML and a related proposal entitled Minimum Information About Microarray Experiments (MIAME). Other efforts include GeneXML, from the GeneX project at the National Center for Genome Resources in Santa Fe, and GEML, from a group led by Rosetta and NetGenics.
These efforts are now converging. The MAML and GEML people are working toward a common standard called MAGE, and the GeneXML folks have announced their intention to adopt this standard when it becomes available. MAGE is being developed as an official standard through the Life Sciences Research Task Force of the Object Management Group.
Despite the claimed convergence, the MAMLs seem to be pursuing MIAME independently, and are in the process of publishing their proposal.
MIAME (also see sidebar) focuses on the scientific context of microarray experiments — where the samples come from, how they’re treated, what the experimental variables are, and so forth — and says little about the data or results. The basic goal is to standardize the way scientists describe their experiments so that other investigators can interpret the results. I question whether they’ll ever attain this goal; the science is varied, the technology is changing rapidly, and scientists are very clever at coming up with new kinds of experiments. Good thinking has gone into MIAME, and it captures a lot of what’s important about microarray experiments. The proposal might be better cast as a set of guidelines to help investigators write up their experiments, rather than as a standard.
MAGE is huge and addresses the entire microarray process from chip design to data analysis (see IT Guy, June 2001). It weighs in at more than 100 pages with many sections yet to be completed. It includes sections on (1) chip design, (2) chip manufacture, (3) sample source and treatment, (4) sequences, (5) hybridizations (referred to more generally as assays), (6) experiments (meaning a coordinated series of hybridizations), (7) protocols (descriptions of laboratory or analytical procedures), (8) units of measure, (9) events (steps in a workflow), (10) data, including images, analyzed images, expression levels, and analytical results, and (11) audit and security.
There’s a lot of good stuff here, but it’s so complicated that it seems unlikely to gain much acceptance. In fact, it’s so complex that it’s hard to tell if it’s right. It would be better, I think, to propose separate, simple standards for each stage of the microarray process. That way people could adopt the parts that were useful to them without having to accept the entire dogma.
Reality or fantasy?
Standardizing cutting-edge science or technology such as microarrays is a daunting challenge. Simple proposals with immediate payback have the best chance of success.
We’re close to having workable de facto standards for some types of microarray data. These standards are analogous to the simple, but very useful, FASTA format from the sequence world. Such standards will come in handy for bioinformaticians charged with developing microarray software, but are not make-or-break.
Lacking a standard, software developers just have to create format converters. This is not so hard, and who knows, maybe some nice person will develop a general converter analogous to Don Gilbert’s ReadSeq program.
Sadly, we’re further away from the standards needed by scientific users of microarray data. Scientists need a way to access analytical results — for example, clusters of genes with similar expression patterns — without having to understand all the experimental details. The big stumbling block is gaining consensus as to what kinds of results are reliable enough and useful enough that it makes sense to put them in a database. It will take time to reach this consensus.
In the meantime, all this S-talk is mere fantasy.
FASTA Format: Start with a KISS
Most successful bioinformatics standards arise after the fact. Developers of new software tend to adopt the formats used by existing popular programs or databases, making these formats even more popular. If the process continues, these formats become de facto standards.
FASTA format — one of the most successful standards in all of bioinformatics — is a prime example. Originally developed by Bill Pearson and colleagues for their FASTA suite of sequence analysis programs, it is now accepted by essentially all sequence analysis programs. Because of its widespread use, you can store your sequences in this one format, and easily analyze them using many different programs.
The format is simple. It defines a format for computer files containing sequence entries. Each sequence entry consists of a one-line description followed by the sequence itself. The description line begins with a greater-than sign (>). The sequence can span multiple lines and continues until the next description line or the end of the file.
This format is useful even though it is hopelessly naïve and ignores many issues that experts regard as critical. The format knows not even the most basic facts about a sequence, such as whether it is DNA or protein, or the difference between genomic DNA and cDNA and between finished sequence and raw. Nor does it know anything about sequence features — for example, where the coding regions are in a sequence — nor about the biological context of a sequence — for example, whether it’s human or bug.
Lacking this knowledge, the format cannot prevent you from doing completely stupid things like running a protein structure prediction program on a DNA sequence, or a gene finder on a cDNA sequence, or an intron finder on a bacterial sequence. Nonetheless, it’s very useful. What’s valuable is not the format per se, but rather its popularity.
This is an example of the well-known KISS principle of software: Keep It Simple, Stupid!
The flip side is that simple standards like this are not truly essential. Yes, FASTA is a big help, but it’s not terribly hard to translate among sequence formats. Every bioinformatics group has its suite of sequence converters, and Don Gilbert’s general converter, ReadSeq, the latest version of which handles 23 formats, has been available since 1989.
Of course, KISS is not all it takes for a format to win our hearts. It’s difficult to predict which formats will catch on.
One example of a great format that has not made the big time is the General Feature Format (GFF), originally described in 1997 by Richard Durbin and David Haussler, two bioinformatics superstars. GFF is a format for defining features on sequences, for example, to say that there’s a promoter at position 10-25 in the sequence, and a transcript at position 50-1,000.
GFF has all the right qualities. It’s simple — just a tab- delimited text file containing obviously necessary information. Several useful programs accept it. It comes from well-respected and liked people. Still, it hasn’t advanced beyond “just friends.”
To my knowledge, no successful bioinformatics standards have ever come about through arranged marriages. It has never worked for people to consciously design a great standard, and then peddle it to the community. It’s not for want of trying.
One spectacular failure is the sequence standard developed by the Life Sciences Research Task Force of the Object Management Group. A distinguished group of bioinformaticians from places such as the European Bioinformatics Institute, Millennium Pharmaceuticals, and NetGenics labored for three years (from August 1997 until November 2000), producing a 178-page document defining this standard. As far as I can tell, no one is using it for anything real. — NG
Setting the Standards
Microarray Gene Expression Database Group
(MGED) & MAML
Life Sciences Research (LSR) Task Force of
Object Management Group (OMG)
LSR sequence standard
GeneX & GeneXML
Bill Pearson ftp.virginia.edu/pub/fasta
Gene Expression Omnibus (GEO)
General Feature Format (GFF)
Sanger Centre www.sanger.ac.uk/Software/formats/GFF
Don Gilbert chipmunk.bio.indiana.edu/~gilbertd/about
ScanAlyze & Cluster
Mike Eisen rana.lbl.gov/EisenSoftware.htm
Are We Speaking the Same Language?
Microarray terminology is in serious disarray and would benefit from some standardization. Here are a few candidate concepts in need of standard terms:
1. Chips, both as physical objects and as archetypes for the physical objects, and chip sets in cases where it takes multiple chips to cover the genes you want to measure. Chip, chip set, and array (not to be confused with the same word that computer and math folks use as a synonym for a table of data) are used interchangeably.
2. DNA affixed to the chip. Some say probe, but MAGE (and Rosetta before them) says reporter.
3. RNA in solution that is hybridized to the chip. The term target, although awkward, parallels probe. Some use the generic term sample instead, but from the user’s perspective, the sample was the cells selected, not the RNA solution applied to the chip. And from the statistician’s perspective the sample is yet something else.
4. Experiment performed by hybridizing a specific RNA solution to a specific chip or chip set. Hybridization seems natural but some call this an experiment which again is too generic. When chip sets are used, we need a way to talk about the hybridizations done on each chip vs. the several hybs done using the same RNA across the chip set. Hybridization set?
5. An organized collection of hybridizations or hybridization sets. Some call it an experiment. How about hybridization series?
6. The data collected from a single hybridization or hybridization set. Such data exists at several levels: raw images, analyzed images, expression levels, and results produced through downstream analysis of these data. Call it an image, a spot, a raw read (by analogy with sequence reads), or a processed read.
7. The ensemble of data produced through a hybridization series, and, by extension, datasets assembled from multiple such series. It’s natural to think of this as a matrix in which the rows represent probes, the columns represent targets, and each cell contains the data for the probe that was produced by hybridizing the target to the chip. The MIAME folks call this a gene expression data matrix. Data matrix would be less a mouthful, but perhaps too generic.
8. The data contained in one cell of the data matrix. Data point or data cell?
A further constraint is that we’d like the same words to work for relatively raw data, as well as increasingly processed data. In the course of processing data, rows or columns are often combined, and the connection between a row or column and a specific target or probe may become tenuous.
When combining data from replicas, for example, the resulting column may contain the average (or other function) of data for several targets. Does it make sense to use target (or sample, for that matter) as the term for such a column?
Likewise, if the chip contains multiple probes from the same gene, you may choose to combine the data for these probes into a single row that better represents the gene. You might even get fancy and construct rows for different transcripts of the gene (if, for example, your chip contains oligos from different exons). Does it make sense to use probe for such a combined row?
As an IT guy, my instinct is to tack the word “virtual” onto the original terms, giving us virtual probe and virtual target. Too geeky?
I don’t expect the community to standardize on terminology. It would be nice, though, for authors to get in the habit of stating the words they use for these concepts to avoid confusion. — NG
Codifying Chip Analysis: A Losing Battle?
MIAME has two goals: to define the minimum information needed to ensure that microarray data can be easily interpreted and independently verified; and to structure the information in a way that supports querying and data analysis.
Its authors see the MIAME standard as a step toward the establishment of standardized public databases. They encourage journals and funding agencies to require that investigators submit their data to these sites.
Notwithstanding the word “minimum” in its name, MIAME is rather expansive. It defines approximately 100 data elements grouped into six main areas: experimental design; chip layout; sample source and treatment; hybridizations; measurements; and normalization and controls. The authors promise a seventh area on quality control in the next version.
MIAME focuses on the scientific context of experiments, rather than data. Only one of the six subject areas
is about data, and it merely requires that the data be there.
Instead, MIAME’s main point is to codify the way scientists describe their experiments. Bear in mind that every scientific paper that reports an experiment seeks to provide enough information so that other scientists can interpret and potentially reproduce the work. MIAME wants to structure the way scientists provide this information.
The presumed rationale for going down this path is an assumption that codified descriptions work better for querying and data analysis. The proposal never explores this assumption.
This is a technical issue that depends on the software used for accessing the data. We’re all familiar with Web search engines that do a fine job on unstructured documents. The better ones nowadays go far beyond simple text searching and include stemming, thesauri, and other sophisticated semantic capabilities. It’s plausible that one could do a better job on data access by training a sophisticated search engine to understand normal microarray papers than by forcing scientists to conform to an imposed structure.