Now that microarray technology has diffused throughout the research community, many are struggling to extract meaning from the volumes of data being produced. At last week’s Cambridge Healthtech Institute Macroresults through Microarrays 3 meeting in Boston, leading statisticians presented tools they are developing to attack this biological behemoth. Following is a summary of some key talks.
Schering Dredges the Deep River of Data
The level of disagreement is so high among statisticians in microarray informatics, that, “if you were able to pull some of the great Talmudic scholars of ancient times in a room to discuss it, they would be right at home,” said Jonathan Greene, director of bioinformatics in the Schering-Plough Research Institute’s department of chemotherapy and molecular genetics.
Schering-Plough has addressed the statistical and logistical challenges presented by microarray data through developing a database system they call “Deep River.” Deep River uses “basic vanilla tools” for analysis, because “our experiments are complicated enough on the back end,” Greene said. Instead, the system focuses on data integration, and includes a data integration checklist designed to link connected sets of data from different researchers.
In another project, Greene’s group is using microarrays to find transcription factor binding sites. They have analyzed gene sequences to select promoter regions with potential binding sites, then printed them on to microarrays. Experiments using these arrays yielded numerous likely binding sites. For example, the group identified 22,000 promoter regions that potentially served as P53 binding sites using sequence analysis, then print probes for these regions on an array. When they performed hybridizations, they found 4,500 candidate sites. The group then chose 14 of these at random and validated them with RT-PCR. They found that 12 responded to P53.
Classification Algorithms ¯ Exposed!
Lately, Todd Golub of the Whitehead Institute and Dana-Farber Cancer Center has led the pack in using microarrays as novel classification tools for cancers. A presentation at the conference revealed that one of Golub’s secret weapons in this effort is postdoc Sayan Mukherjee.
In the presentation, Mukherjee laid bare the statistical underpinnings of algorithms that Golub’s group has been using in its microarray-based experiments.
First, Mukherjee said the algorithms must be designed to account for a feature of microarray analysis that he termed “the curse of dimensionality”: Each dataset includes a low number of samples with high-dimensionality per sample, given that there are 7,000 to 16,000 probes on each array.
Support Vector Machines (SVM) are a good choice for analyzing microarray data, according to Mukherjee, because they are stable even in highly dimensional datasets. In graphic illustrations, he showed how SVMs at their most basic level involve forming a linear decision boundary between two groups of objects. Because of this “line in the sand” characteristic, the SVM algorithm allows the user to calculate the confidence values for rejection of classification, he said. (SVM decision boundaries can also be extended into three dimensions.)
Mukherjee also discussed recursive feature elimination, which involves solving a support vector machine problem for the vector, ranking the order elements of the vector by their absolute value, then discarding the genes that correspond to the vectors with the smallest absolute magnitude. In a linear SVM, this means the genes that are closest to the line dividing the two sets are eliminated. This process is then repeated on this reduced gene set to obtain increasingly higher confidence levels of classification.
Another procedure, the leave-one-out procedure, involves removing one point from a SVM data training set and then training the set on the remaining points and testing the set on the one left out. This process is repeated for all of the points in the classification set and is a way of obtaining an unbiased subset of features.
Mukherjee is now working to apply the classification framework to survival algorithms, look at other algorithms, and extract sub-taxonomies with independent gene sets as well as using meta-genes as classifiers.
In a related talk, “appropriate statistical methods for coping with error in DNA microarray measurements,” Trey Ideker, a computational biology fellow at the Whitehead Institute, discussed two statistically-based tests for identifying differentially expressed genes, Variability and Error Assessment (VERA), and Significance of Array Measurement (SAM). The VERA model uses all of the genes in the experiment, and include a measurement for multiplicative error that decreases with intensity, and additive error, which increases with intensity. After using VERA to develop an error model for a particular data set, users can use SAM to determine a significance score for each gene. Software incorporating VERA and SAM is available at http://www.systemsbiology.org/VERAandSAM/.
“Lowess” Common Denominator
Jason Gonçalves, chief scientific officer of Iobion Informatics, discussed common methods of data normalization, including intensity averaging, ratio averaging, and the Lowess normalization.
The first uses the ratio between the total intensity values of the two dyes as the normalization factor, while the second determines log ratios, then averages them to determine the normalization factor. But neither one accounts for intensity-dependent effects, resulting in the fact that high-intensity spots drive intensity normalization, and ratio normalization can be overly sensitive to low-intensity spots, whereas the Lowess normalization applies an intensity-dependent function to the data, Gonçalves said.
Sometimes, additional steps are needed to normalize data for spatial effects, he added. These spatial effects, in which a portion of the array appears to have a streak of high- or low-intensity spots, can be dealt with by dividing the array into subarrays made by each individual spotting pin, and normalizing each subarray separately, said Gonçalves. In a comparative assessment of the three microarray normalization methods using both global — and subarray-based approaches, Gonçalves found that the subarray Lowess approach resulted in the lowest p-values. “The take home message is that you clearly want to use the subarray-based Lowess normalization when there is an intensity-dependent effect, [but] you can use it in all cases,” said Gonçalves.