What happens if you invite a bunch of biologists, statisticians, and medical researchers to the same meeting? Even though this was only the first replicate, Cambridge Health Institute’s experiment in Washington D.C. last week appeared to be a success: At times, the conference room was so packed, the participants of the Microarray Data Analysis meeting might have felt like oligos on a gene chip.
Getting the various factions to talk on a daily basis, however, seems not nearly as easy as getting them to listen to each other’s talks: A common complaint heard at the conference was that experimental microarray researchers — having finally discovered statistics for analysis — don’t consult with statisticians early enough.
Following is a summary of tidbits from the meeting.
Amersham: Amplifying Codelink Arrays
Less than two months after snapping up Motorola’s CodeLink arrays, Amersham Biosciences had some news to share that might step up the level of competition in the industry: At the end of his talk praising the performance of the arrays, Phillip Stafford, biostatistician for the CodeLink platform, announced that a number of new arrays are on their way. A new 10K human expression array with 10,000 spots that will include splice variants and unique genes that were left off the current human array is scheduled for release by the end of this year, followed by a 10K mouse and a 10K rat array later on. A human 20K array — currently in alpha-testing — could be released as early as the spring of next year, with a mouse and a rat 20K array to follow later in 2003. Finally, depending on the success of the 20K chip, a human 40K array — all spots on one slide — might be ready for market by the end of next year.
Ask Your Statistician (But Don't Wait Until It's Too Late…)
It is crucial to think about statistics even before beginning a microarray experiment, and two speakers used the same 1938 quote from statistician Sir Ronald A. Fisher to make that point: “To consult a statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiments died of.”
Just four years ago, if you suggested using statistics for a microarray experiment at a meeting, you could be (almost) shown the door, remembered Tom Downey, president of St. Charles, Missouri-based Partek. This has changed meanwhile, he said, but researchers still don’t always use the right tools for the right purposes. Using exploratory tools like cluster analysis, for example, to determine which genes are differentially expressed, is like using a hammer to put a screw into the wall, Downey said.
In building predictive models for diagnosis or prognosis, he explained, a common mistake is to select marker genes first, using all of the samples, then estimate the prediction error of the model later on using cross-validation. This, Downey said, leads to biased error estimates because testing data was used to select the marker genes.
But what needs to come before building models is the actual microarray experiment, and designing it well. Excluding sources of variability is paramount to good results, according to Jay Tiesman, who leads the genomics group at Procter & Gamble’s Miami Valley Laboratories, and is well worth the time and effort: “We have learned a lot from some really bad and really expensive experiments,” he said. In fact, Tiesman’s company has started forming an “experimental design team” that meets before an experiment is even planned. “Nobody is allowed to go on his own,” he said.
But how to detect variation that is not due to a biological effect? Profiling the response of individual probes on an Affymetrix chip is one solution, according to Jim Veitch, president of Corimbia. The Berkeley, California-based company developed a software called Probe Profiler that he said detects and normalizes for quality issues, such as bad probes and cross-hybridization effects, as well as mRNA quality, saturation, scanner problems, and chip defects.
Grier Page from the University of Alabama, at Birmingham Microarray Research Data Analysis Clearinghouse outlined the need for statistics throughout a microarray experiment. He presented statistical techniques for tasks such as class discovery, class prediction, and class discrimination, especially for small sample sizes. Contradicting “the myth that microarrays have no hypothesis,” he stressed that “there always needs to be a biological question in the experiment,” even if it is as “nebulous” as asking which genes will go up or down in response to a drug. Though microarray experiments seem to be especially plagued by variation, he said, traditional methods like Northern, Southern, or Western blots are likely to show just as much variation. But since these techniques do not generate massive amounts of data, this might have gone undetected so far, Page noted.
Experimental quality issues are not the only problems that may affect the results of experiments. Willy Valdivia Granda from the Plant Stress Genomics and Bioinformatics Group at North Dakota State University pointed out that micro-array experiments frequently pick up pseudogenes. Analyses in Arabidopsis suggest that about eight percent of its genome could be comprised of pseudogenes and that their products, when hybridizing to microarrays, add a considerable source of noise. Removing those pseudogenes from the analysis increased the prediction accuracy of protein-protein interaction algorithms, Granda reported.
Deconstructing the Data
A number of researchers presented improved analysis tools for microarray experiments. Li Liu from the Statistical Research Center at Pfizer explained a clustering method that takes microarray data progressively apart, permutating rows and columns of the matrix, to find general patterns and specific sample-by-gene effects. This so-called robust singular value decomposition (SVD) analysis is well suited to deal with missing values, outliers, and non-normal data distribution, she said.
The lower the level of a gene's expression in a sample, the lower its signal in a chip experiment — and the greater the relative variance of that signal. However, most standard statistical methods don’t take the heterogeneity of variance in gene expression experiments into consideration, according to David Rocke, a professor of biostatistics from the University of California, Davis. In his talk, he presented a class of transformations especially suitable for microarray data that stabilize this variance, allowing common statistical methods to be used. These methods could also be applied to other types of biological data, such as protein mass spectrometry data, Rocke said, mentioning that a first software version is currently in alpha-testing and would be available from his website within a few weeks.
Model Your Expression
On a similar note, Xuemin Fang from the department of biostatistics at Harvard University’s School of Public Health talked about a new three-step modeling procedure to analyze oligonucleotide microarray data, called MBEI for model-based expression index, that she said can deal better than other methods with low-level gene expression. It models cross-hybridization for each individual probe on the array and will eventually be implemented in dChip, the program that Harvard School of Public Health professor Wing Wong developed, she said, after the probe sequence information has been integrated in the algorithm as well.
Using unsupervised classification tools to find, for example, novel tumor subclasses in microarray data has its limits due to the inherent noise of gene expression matrices, cautioned Zoltan Szallasi from the Children’s Hospital’s Informatics Program at Harvard Medical School. If a given tumor is caused by fewer than a certain number of correlated genes, then unsupervised analysis will in fact never be able to find this, he said. Another problem with such studies is that gene expression in each tumor is often only measured once, making it impossible to apply standard statistical analysis. To overcome this problem, he and a colleague have introduced an information theoretic approach, he said. Szallasi also presented a set of simulation-based tools that can generate random gene expression data, while retaining the original data structure that reflects, for example, the overall level of gene co-regulation. These methods, he said, provide a more accurate estimate whether a given result extracted from massively parallel measurements appears by chance or not.
Turning from experimental analysis to data interpretation, Korkut Vata from the department of pathology at Duke University showed how the basic DNA sequence similarity search tool, Blast, could be extended to microarray data. His MA-Blast, a data mining approach for concordant expression profiles across different laboratory platforms, compares collections of gene expression profiles in silico. But problems remain: There is no “universal currency” for microarray data yet.
One impediment to comparing microarray data across array platforms, indeed, is a lack of calibration standards for different scanners. Clondiag Chip Technologies, a startup company from Jena, Germany, has developed a chip, called FluorIS, that could become such a standard. The chip has features of defined shape, size, and fluorescence intensity.
But what to do with a list of hits from one or several microarray experiments? John Weinstein, a senior research investigator at the National Cancer Institute, provided a possible answer: GEEVS, short for Genome Exploration and Visualization System, a program package he and his colleagues at the NCI’s Genomics and Bioinformatics group developed to help integrate information from various databases, including gene expression. Among the tools already available to researchers are MedMiner for searching and organizing the biomedical literature, LeadMiner for correlating chemical structures with microarray expression data (described in the August 2002 issue of The Pharmacogenomics Journal), and MatchMiner for identifying gene names correctly. A new tool, GoMiner, for discovering and visualizing molecular interactions, will likely be available this week at http://discover.nci.nih.gov, Weinstein said.
Take Tthe Onto-Express
Onto-Express, presented by Sorin Draghici, director of the Intelligent Systems and Bioinformatics Laboratory at Wayne State University, is another tool that helps make sense of gene hits. Onto-Express accepts lists of genes found to be differentially regulated and finds the biological pathways in which they are involved. However, as Draghici pointed out, the significance of a certain functional group depends on the representation of that function on the array. In order to allow the user to distinguish between real and random phenomena, Onto-Express calculates confidence parameters for each functional category. The software is freely available at http://vortex.cs.wayne.edu/projects.html. Moreover, the site provides tools for the design of custom microarrays based on gene functions (Onto-Design) and for the selection of the most suitable commercial microarrays for a given experiment (Onto-Select).
Finally, Ugis Sarkans from Alvis Brazma's group at the European Bioinformatics Institute provided an update on microarray standards and ArrayExpress, EBI’s public repository for microarray data. So far, the database contains about ten data sets, he said, but several more are currently being prepared to be added. An updated version of MAGE, the object model developed for the exchange of gene expression data, will be presented at a meeting of the Object Management Group in Helsinki at the end of this month, he said, and MAGE may become a “stable specification.” At ArrayExpress, curation tools for managing and developing ontologies, tracking submissions, assigning accession numbers, and other purposes, are currently under development in preparation for increased data traffic.