Under the Hood
Nat Goodman examines the innards of microarray databases
A complete microarray project has five major components: chip design, chip construction, data acquisition, image analysis, and data analysis. To get your project zooming at top speed, you need some powerful systems under your database hood.
Grab a work lamp and let’s take a look at how these babies are put together. We’ll start with the drivetrain — the part that puts power to the wheels of the project.
Each stage of a microarray project needs to be tracked by software. The traditional laboratory information management system tracks wet phases. For dry phases you need an analysis information management system. (See sidebar for more about LIMS and AIMS.)
The job of chip design is to decide which genes to put on the chip and which probes to use for each gene. This amounts to selecting a set of clones for cDNA arrays or, for oligo arrays, designing oligos for the regions of each gene you want to cover.
From an informatics standpoint, chip design is a heavy-duty sequence-analysis problem. You need an AIMS to keep track of all the analyses. The output is a chip layout, or a computerized blueprint of what is spotted on the chips.
Chip construction starts from a chip layout and prints a stack of chips. Clone or oligo picking is followed by setting up plates for the spotting machine and then letting ’er rip. There’s also a pile of bookkeeping to keep track of the zillions of clones and oligos involved, many of which are bought from supply companies and have to be marshaled through your company’s bureaucracy. Informatics-wise, this is a basic LIMS problem plus inventory and order management. The output is a list of chip identifiers, typically barcodes.
Data acquisition is where the rubber meets the road. The inputs are a sample and a chip. Tasks include RNA extraction, labeling, hybridization, and finally image capture (i.e., scanning). Chip data comes from the chip construction step above. Sample data might come from a sample acquisition LIMS, but more typically, is entered manually. To an informatician, this is a mainstream LIMS problem — challenging, but nothing that’ll crack a cylinder. This stage is mostly wet, but the output — a scanned image — is dry. This is where we leap from the wet world of plates and robots to the dry world of computer files and software.
Next comes image analysis, whose job is to convert the scanned image into estimated expression levels for each probe on the chip. The process begins with several reasonably straightforward steps: gridding, which overlays a rectangular grid onto the image; segmentation, which figures out where the spots are within the grid; and intensity extraction, which calculates the brightness of each spot and its local background.
The remaining work — getting from intensities to expression levels — is more contentious and is an area of active research. The output is an estimate of the expression level for each probe on the chip, possibly augmented with intensity information.
Image analysis is largely automatic with robust chip technology, and the informatics aspects are trivial — just record the answers. With other technology, this stage can entail a lot of hand tuning, and it’s worthwhile to keep track of the work in an AIMS.
The final phase — and it’s a huge one — is to analyze expression-level data from multiple experiments and learn something useful about the biology. This is where the Menu of Microarray Software that I discussed in my March 2001 column comes onto the table. Tasks include normalization, filtering, pattern discovery (e.g., clustering), and biological interpretation. The output is biological knowledge. Some of this takes the form of structured analysis results, such as clusters of genes, but most lives in unstructured documents.
The Microarry Mechanic’s Maze
It goes without saying that the system has to do a good job with the expression data itself. This turns out to be a tricky technical problem.
Different chip technologies produce different kinds of measurements. Measurements using spotted arrays involve foreground versus background intensities, while Affymetrix chips use perfect probes versus mismatched probes. Affy produces one-color data, while spotted arrays produce two colors, and we’re starting to see instruments with dozens of colors.
To make matters worse, the data changes its nature as it travels down the analysis road. With spotteds, people quickly convert intensities into ratios of the two colors, which they often express as log ratios or fold changes. This practice is beginning to change because of a growing body of research indicating the importance of reporting absolute intensities, not just ratios.
On Affy chips, data start in absolute (single-chip) or relative (multi-chip) formats. Neither option is great for downstream data analysis because Affy provides qualitative, rather than statistical, quality indicators. Many users quickly massage their Affy data into personalized formats that are more suitable for data analysis purposes.
The problem will become even more complex as statistical methods come into greater use. It will soon become standard practice to include statistical estimates of uncertainty with the data — error bars, standard deviations, or whatever — but the specifics will vary considerably.
Given this complex maze, there are three ways the database can cope. One is to punt and simply store the data without understanding what it means — a mega-copout that solves nothing. Another is to hardwire the particular formats of interest to you today, recognizing that you’ll have to modify the database repeatedly as the field evolves — this is eminently pragmatic. A third is to develop a means for easily incorporating new data formats and converting among them — this can be accomplished with élan using the computer science method called object-oriented modeling.
Because microarray projects involve hundreds or thousands of experiments, the system has to organize the information so you don’t get lost. This is really not a microarray issue per se — it comes up in many large-scale projects.
A good way to organize information is to reflect the experimental design. The reason you have a lot of experiments here is because you’re systematically varying a set of factors. For example, you might take cells from normal versus diseased tissue, apply various doses of a drug, and measure expression at several time points.
If you tell the database the values of the factors tested in each experiment, then users can find data by searching on those factors. For example, a user could retrieve all experiments involving doses less than 10 mg at time points between 30 minutes and four hours.
It’s also useful to organize experiments into a project/sub-project tree similar to the folder structure on Windows or the Mac, so that users can group experiments in ways that transcend the experimental plan.
The system also needs a gene index database that connects the genes represented on the chip to biological knowledge about those genes in public or proprietary databases. This is the Field of Genes I wrote about here in October 2000.
A system this complex also needs a master controller to tie it all together. The master provides access to the outputs of the major stages (chip layouts, chips, etc.), as well as the experiment organizer and gene index.
A complete microarray database is a pretty complicated contraption. But it breaks down into logical parts, many of which are pretty generic.
In an ideal world, software developers would build the parts separately and connect them together — an approach called modular design. In the real world, they usually build the whole thing as a gestalt, focusing on the parts that are most pressing for the particular laboratory today.
Next time you’re test-driving a microarray system, don’t just kick the tires and play the audio. Get under the hood and ask the sales guy, “Hey, what kinda LIMS you got in here? Where’s the AIMS? Is your measurement schema object-oriented? Show me the gene index.” And send me a picture of the look on his face.
The ABCs of LIMS and AIMS
Laboratory information management systems vary considerably in sophistication, depending on the scale of the laboratory, variability in tasks to be done, and rate of change.
A LIMS worries about what, when, where, and who. It records what procedures were applied to each sample, any run-specific parameters, and what the results were. It also knows when each procedure was done, where it was done (e.g., on which instrument), and by whom (e.g., which technician).
A more sophisticated LIMS can be actively involved in accomplishing laboratory procedures by controlling robots and other instruments, or more simply by providing work orders to personnel.
An even more sophisticated LIMS can be proactive if it has knowledge of the laboratory workflow. The workflow defines the correct sequence of tasks in the process. Armed with knowledge of the workflow, a LIMS can ensure that work is routed correctly from task to task, detect errors (such as giving a plate to the wrong robot), anticipate bottlenecks, and much more. The downside is that it’s a lot of work to define the workflow and enter it into the system, debug it, and keep it up to date when the laboratory process changes.
A LIMS generally supports a range of administrative function as well. These include inventory management, monitoring of quality measures, resource planning (i.e., instrument and personnel usage), and reporting.
An analysis information management system is analogous to a LIMS, but tracks computational processes instead of physical ones. An AIMS records what procedures have been run, parameter settings (which are generally numerous), and results. It also knows when each procedure was done and by whom.
A more sophisticated AIMS can be actively involved in accomplishing analyses by directly invoking programs, or by providing work orders to personnel. The latter is necessary for interactive analyses, or when working with programs that lack programmatic interfaces.
An AIMS can be proactive if it is given knowledge of pre-defined analytical procedures. Armed with this knowledge, the system can automatically run the procedures at appropriate times, or direct personnel to do so. It is easiest to represent pre-defined analyses as scripts in a programming language, such as Perl.
An AIMS generally supports a range of other functions, including management of compute-clusters and parallel execution of tasks, resource planning (i.e., computer and personnel usage), and reporting. Two important quality control issues are support for software testing — providing a means to separate test runs from production runs — and configuration management — the ability to track versions of software and roll back to consistent sets of old versions when needed.
AIMS versus LIMS
LIMS and AIMS are similar from a high-level functional perspective, but differ in many important details. A few key differences: Laboratory tasks generally run for a lot longer than computational tasks — hours versus seconds or minutes. Laboratory processes generally have simple routing, while computational processes can involve complex decision logic. Laboratory procedures change much more slowly than computational ones — even the most innovative laboratory chief can’t turn the laboratory around on a dime, whereas good programmers can change software with the click of a mouse.
People sometimes try to adapt a LIMS for use as an AIMS. This works in simple cases, but breaks down when the going gets rough.
|GATC||Affymetrix, Consortium||de facto||Affy||proprietary||www.atconsortium.org|
|GEO||NCBI||de facto||analyzed images||tab-delimited text||www.ncbi.nlm.nih.gov/geo/|
|MIAME/MAML||EBI, MGED Consortium||proposed||all||XML|| www.mged.org/Annotations
|Stanford MicroArray DB||Stanford||intensity/image||http://genome-www4.stanford.edu/MicroArray/SMD|