Skip to main content
Premium Trial:

Request an Annual Quote

Clustering Proteins


Last month, we toured the badlands of protein interaction data. The datasets are full of errors, and you can’t trust any single piece of interaction data. This month, we’ll start looking at approaches to taming that wilderness. It’s still early days, and while exciting research papers abound, working software is out of sight.

Graph clustering is a good way to get started with interaction data. Clustering makes any dataset more manageable and finds a lot of obvious relationships that you’d hate to miss.

Three to Try

My quick search of PubMed pulled up more than 30 papers on analysis of protein interaction data, possibly in combination with other data types. Just over 20 had interaction data as their central focus, and I concentrated on these. I avoided papers on visualization methods for interaction data, a topic I covered here last October.

The papers share a basic style. The authors describe their new method, often in gory mathematical detail with as much jargon as can be crammed into the allotted space. Then they validate the method in yeast by comparing their results to other people’s “high confidence” interaction datasets, or by showing concordance with Gene Ontology annotations, or correlation with microarray gene expression data. None of the papers goes beyond yeast.

All of the methods use graphs to represent the interaction data. In the usual approach, which I’ll call an interaction graph, the nodes represent proteins and the edges indicate which proteins interact. In some papers, the edges are labeled with information about the type or strength of evidence supporting that interaction.

One very intuitive and simple clustering method is from Alexander Rives and Timothy Galitski at the Institute for Systems Biology, where I work. They start with a plain interaction graph and calculate the distance (path length) between each pair of proteins. Then they apply an “inverse square law,” defining the association between two proteins to be one over the distance squared. This assigns higher weights to short distances. Finally they use hierarchical clustering, just like in microarray work, to cluster proteins by association. The result on the yeast dataset is striking, showing very clear clusters that correspond to well-known biological processes.

A more sophisticated version of this idea is described in a new paper from Christos Ouzounis’s group at the European Bioinformatics Institute. They start with a labeled interaction graph showing the strength of evidence for each interaction. Then they convert this into a second graph in which the nodes represent interactions and the edges connect two interactions that involve the same protein.

This may seem counterintuitive, and you may want to draw a few small examples to see what’s going on. Each edge of the new graph (which represents a protein) is labeled with the average weights of all interactions involving the protein. Finally, they cluster the graph using a fancy method called the Markov Cluster Algorithm (see A key advantage of the method is that a protein can belong to several clusters, which makes sense since a protein can participate in several biological processes.

A third clustering method is from a Chinese group led by Runsheng Chen. They use a graph method called spectral analysis to divide the graph into quasi-cliques and quasi-bipartites which are, respectively, tightly and loosely connected subgraphs. A quasi-clique is essentially the same as a cluster.

A lot of methods add layers to these kinds of approaches by correcting problems before clustering or using clusters as the starting point for further analysis. We’ll take a look at these methods in future columns.

For a categorized list of the papers Nat used for this column, check out www.


Nat Goodman, PhD, is a senior research scientist at the Institute for Systems Biology and is co-founder of HD Drug Works, which tests treatments for Huntington’s Disease. Send your comments to Nat at [email protected]


The Scan

Researchers Develop Polygenic Risk Scores for Dozens of Disease-Related Exposures

With genetic data from two large population cohorts and summary statistics from prior genome-wide association studies, researchers came up with 27 exposure polygenic risk scores in the American Journal of Human Genetics.

US Survey Data Suggests Ancestry Testing Leads Way in Awareness, Use of Genetic Testing Awareness

Although roughly three-quarters of surveyed individuals in a Genetics in Medicine study reported awareness of genetic testing, use of such tests was lower and varied with income, ancestry, and disease history.

Coral Genome Leads to Alternative Amino Acid Pathway Found in Other Non-Model Animals

An alternative cysteine biosynthesis pathway unearthed in the Acropora loripes genome subsequently turned up in sequences from non-mammalian, -nematode, or -arthropod animals, researchers report in Science Advances.

Mosquitos Genetically Modified to Prevent Malaria Spread

A gene drive approach could be used to render mosquitos unable to spread malaria, researchers report in Science Advances.