The list of available clustering methods for gene expression analysis seems to be growing as quickly as microarrays are churning out new data. The latest addition to the clustering toolkit comes from Jyotsna Kasturi, a computer scientist at Penn State University, and her colleagues, who have applied an approach commonly used in signal processing to microarray analysis.
The approach, called KL clustering, is based on the Kullback-Leibler divergence, a distance measure that offers an alternative to the Pearson correlation that is commonly used to assess similarity between data points in hierarchical clustering methods for microarray analysis. While the KL divergence is a “fairly standard technique” for signal processing and other data mining applications, “it has not really been used for clustering like this,” Kasturi said.
The Penn State team found that KL clustering, which uses the KL divergence as the dissimilarity measure for a self-organizing map, produced better clusters than an SOM using the Pearson correlation or hierarchical clustering with the Pearson correlation. A paper describing their findings recently appeared in the journal Bioinformatics (19, 449-458).
“Good clusters,” as Kasturi described them, can be verified as they were in the paper with the Davies-Bouldin cluster validity index, or simply by eye, at least for a first pass. According to Kasturi, KL clustering provided “dense” clusters, in which all the points in the cluster were very close to each other in terms of similarity, in addition to well-distributed clusters. With hierarchical clustering, “we found that sometimes one or two clusters contain 80 percent of the data, so there’s no way that cluster can be good,” Kasturi said.
KL clustering is an iterative method that reevaluates the clusters it is building pass-by-pass, “whereas hierarchical does it all in one shot, whether it’s right or wrong,” Kasturi said. “So in that sense, [hierarchical clustering is] very fast, but it’s very incorrect.”
Kasturi warned that KL clustering is much slower than hierarchical methods — a 500 data-point set would take about three to four minutes to cluster using the hierarchical approach but about 15-20 minutes using KL, she estimated. However, she noted, “If you make a parallel version of it, it might be faster.” In addition, she said, KL requires a technical understanding of the system that most biologists wouldn’t have. “You have to choose a set of parameters, and with hierarchical clustering you don’t. Every time you perform hierarchical it will give you the exact same results, which is a useful thing — that’s why it’s so popular. Whereas with these methods, your initialization will reflect on your final results, so they have to be very intelligently chosen.”
Kasturi said she is currently developing a version of the software that optimizes parameters so that the user doesn’t have to do it, “but it’s not yet there.” The KL code is currently available by request, and Kasturi said she expects to have a graphical interface for the system available soon.
KL clustering isn’t intended to supersede or replace other, more familiar clustering methods in bioinformatics, Kasturi said, “but it’s definitely one more method to try.”