How can you normalize when things aren’t normal? The general form of this question is plaguing a lot of people right now, but microarray researchers have been struggling with it in a more specific way for quite some time.
At the recent Cambridge Healthtech Institute Microarray Analysis conference in Alexandria, Va., biologists and statisticians came together to discuss solutions to this issue and others relating to microarray statistics and analysis. Following are some highlights, tidbits, and asides from the conference:
Affy Algorithm Discussed, Critiqued
Affymetrix’s new algorithm, which it is planning to release in version 5.0 of its data analysis software, outperforms the much criticized algorithm in version 4.0, said Tarif Awad, the company’s data analysis team manager for genomics collaborations.
The company is planning to release this algorithm and version 5.0 by the end of the year, or if not, “certainly by the end of March.”
An audience member asked what the company thought about spiked controls for Affymetrix chips. There are discussions at Affymetrix, Awad responded, “about whether we would like to create a kit with calibrated spikes.” This kit would enable researchers to estimate the total amount of RNA they are working with and to have a universal control reference.
Responding to another query about the company’s response to alternative Affymetrix array analysis methods developed by outside scientists such as Wing Wong at Harvard, Awad responded favorably. “We would like to see them. We encourage them,” he said.
Flattening out the Banana
Many normalization methods for microarray data fail to correct one major problem: the data’s banana shape. Like this tropical fruit, the data, when spread out on a scatter plot, curls up at the ends. This means that high-intensity and low-intensity data can become skewed and less reliable. And all it takes is a small number of really bright spots to skew normalization of data, said Jason Goncalves, chief scientific officer for San Diego-based Iobion Informatics.
While anyone who has experienced spot saturation or struggled with determining low-end thresholds for fold changes knows that problems can occur at the ends of the data spectrum, Goncalves proposed a statistical solution: the Lowess method. This method of normalization, adapted for microarrays by statistician Terry Speed at the University of California, Berkeley, flattens out the banana on the scatter plot by normalizing by sub-grid, and corrects for both skewed intensity and expression ratio plots, Goncalves said.
FDR and the New Deal in Error Correction
Thomas Downey, CEO of Partek, gave an overview of statistical methods for microarrays, pointing out an alternative method of statistical error correction to the traditional methods such as the Bonferroni adjustment, which is designed more for smaller samples, not for tens of thousands of genes. Instead of applying a correction such as Bonferroni or Sidak, Downey and others discussed using a method known as False Discovery Rate (FDR). This method would be substantially less stringent than an error correction method that would not allow for one false positive, and could be used to measure the relative efficacy of microarray technologies in terms of the rate of false positives.
While use of such a flexible metric as the FDR might tend to make one fear that microarray analysis is too tolerant of false positives, Robert Nadon, director of informatics at Imaging Research, meanwhile, discussed the need for control of false positives. “This used to be controversial,” he said. But now, most researchers appreciate the need to apply rigorous statistics to microarrays.
Downey had discussed the t-test (a measure of the difference between two group means divided by the square root of the estimated variance over the sample size) as a reliable, back-to-basics approach for statistically verifying microarray data, but Nadon spoke against using this test to look at microarray data, since it assumes a sample size is small, and a single microarray contains thousands of data points. Instead, he suggested using a similar “z test” in which the variance is not estimated, but derived from a pooled measurement of all of the variances of all the genes in the same experimental condition (the difference of the means would then be divided by the square root of each pooled variance over the number of replicates in each group.)
Affy’s Loyal Customer Disses its Statistics
Gene Logic, which is among the biggest customers for Affymetrix GeneChips, if not the biggest one, discussed how it has developed its own techniques for reading and normalizing Affy data. Clicking onto a slide of a microarray with a large bright band across the lower third of spots, Gene Logic group leader of data analysis Michael Elashoff pointed out how some chips can have what he called “haze bands” across them, and how this pattern — obviously an artifact of some mistake in the manufacture or hybridization process — should be recognized so the data from that band can be eliminated. Other chips, he said, have what he calls “crop circles,” rings of darkness in a defined pattern that can indicate contamination or a problem.
For normalization, Gene Logic spikes in controls to the sample, a method that Elashoff said “may be better than the Affymetrix method” for normalization. The use of housekeeping genes, another method for within-chip normalization that Gene Logic tried for its toxicogenomics program with rat chips, did not work, according to Elashoff, because it was impossible to find a gene that was not somehow regulated by any toxic compound.
In the area of toxicogenomics, Gene Logic has found, however, that small fold changes can be useful markers when replicate experiments confirm them. (See similar comment in Lab Report, p. 10) Consequently, the company does not use Affymetrix’s absence and presence call as a basis for deciding which genes are up- or down-regulated by a certain compound because the Affymetrix method is too conservative, Elashoff said.
Combining Data from cDNA and Affymetrix Arrays
When it comes to comparing microarray experiments across different platforms, researchers can feel like they’re back in the era when Macintosh files didn’t translate into the IBM platform. Fold change on an Affymetrix chip is not the same, for example, as fold change on a cDNA array.
A poster presented at the conference by Hayward, Calif., bioinformatics startup X-mine proposed to remedy this problem using filtration and normalization algorithms that minimized inter-chip fluctuations. The group first used the algorithms to normalize pooled data from two different Affymetrix chips. Once normalized, the samples then showed a hierarchical clustering pattern that reflected their biological function in a cell rather than the array of their origin. Next, the group compared the pooled Affymetrix data to that of a cDNA array platform. For more information, go to www.x-mine.com, or e-mail [email protected]
NCI Releases Free Data Mining Software
Watch out GeneSpring: B ioinformatics wizards at the National Cancer Institute’s Laboratory of Experimental and Computational Biology have developed a new data mining program for cDNA arrays that is entirely free and available for download.
The program can be run as a stand-alone application on a desktop computer or as a Java applet on a user’s web browser. It includes basic visualization tools for microarray data including scatter plots, histograms, and expression profile plots; as well as more advanced clustering analysis tools including similar clones, K-means, hierarchical clusters, and others. The program also allows users to filter genes in various ways, including selecting various types of clones, or setting up tests that the clones must pass in order to be considered. The program is linked in with other public genomic databases. It can be accessed at www.lecb.ncifcrf.gov/ MAExplorer.