Skip to main content
Premium Trial:

Request an Annual Quote

Columbia Team Uses New Protein-Protein Interaction Algorithm to Predict 300,000 Human PPIs


Columbia University researchers have developed an algorithm for predicting protein-protein interactions based on a combination of three-dimensional structural information and functional data.

In a study published this week in Nature, they used the algorithm – named PrePPI – to predict more than 30,000 protein-protein interactions for yeast and 300,000 for humans with confidence levels equivalent to that of conventional yeast two-hybrid screens, Barry Honig, director of Columbia's Center for Computational Biology and Bioinformatics and leader of the study, told ProteoMonitor.

While there has been considerable effort to develop PPI prediction algorithms based on information like sequence homology and co-expression, less work has been done on using structural information – in part, Honig noted, due to the relatively small number of protein structures that have been "solved" via techniques like X-ray crystallography.

Given this issue, the Columbia researchers worked to expand the structural information available to them, using, for instance, approaches like structural alignment, "where you superimpose structures of different proteins to see if they are related," Honig said.

"The trick is that there are far fewer structures available than there are sequences, but we make up for that in large part by using models and looking for very distant structural, geometric similarities between proteins to get information," he said.

This, Honig noted, led to "a major expansion" of structural information available as raw material for powering the algorithm.

The question of how useful physical structure is in predicting protein function and interactions is one that people have been working on "for some time," he said. "If you have two proteins that don't have an identifiable sequence relationship but they look alike, how likely is it that they are related in the sense that you could deduce function about one from the other?"

"We and others have been saying that there is a lot of information that can be gathered" based on structure as opposed to sequence," Honig said. "And we've shown it in individual cases. But here we just went for it and use it as extensively as we could, and we were able to make … high-confidence predictions for 300,000 pairs of human proteins and many more at lower confidence."

In addition to structural information, the researchers also included functional data like co-expression and pathway annotation. "So there's an initial amplification of what you learn from the structural information and then its combination with other sources leads to both greater amplification and greater reliability," Honig said.

With the interaction data in hand, the authors selected 19 predictions to validate experimentally via co-immunoprecipitation. These instances were chosen largely because they were novel interactions of significant biological interest, Honig said. "You can't validate 300,000 predictions by looking at 19, but the reviewers wanted us to do some validation, especially of some interactions that weren't obvious."

Ultimately, they were able to experimentally validate 15 of the 19 cases, including interactions suggesting a previously unobserved convergence of pathways regulated by the nuclear receptors PPAR-У and LXR-β; interactions between the cytokine-induced signal transduction suppressor SOCS3 and the RAS/MAPK pathway; interactions between tyrosine kinases and the clustered protocadherin proteins; and interactions between SATB2 and the Emerin proteome complex 32.

Ultimately, though, validation of the algorithm was "statistical," Honig said. "How well does it work [to predict existing data sets] pretending that we don't know the answer?"

The researchers trained the algorithm using combined interaction data from a number of databases, dividing this data into two sets – high confidence, or interactions cited in more than one publication; and low confidence, interactions cited in only one publication.

They compared the results generated by PrePPI to datasets generated by high-throughput PPI detection methods like yeast two-hybrid screening, finding that PrePPI's results were comparable to that of conventional high-throughput techniques.

To further validate the algorithm they tested it against roughly 24,000 new human protein interactions that have been added to public databases since August 2010, finding good correspondence to experimentally validated interactions.

Assuming these results hold, PrePPI could prove a valuable tool for PPI prediction and hypothesis generation, particularly given how labor-intensive conventional PPI studies are. A recent study mapping roughly 6,200 interactions between about 2,700 proteins in Arabidopsis thaliana, for instance, took a consortium of more than 20 laboratories four years to complete (PM 7/29/2011).

Yeast two-hybrid screening and co-immunopreciptation experiments are among the most common experimental techniques for detecting PPIs. The former is typically good at identifying brief, transient interactions while the later is better suited to detecting stronger interactions. Honig said he didn't know if the PrePPI algorithm had a bias toward either weak or strong interactions, noting that it was something the researchers hadn't yet looked into.

"This is all very new," he said, noting that the PrePPI database is up and available to researchers interested in using the tool in their work.

"It's very easy to access, and so we hope people will [access it] and we will see what they find," Honig said.

He added that his lab has started around half-dozen collaborations with other Columbia researchers who are interested in using the tool to explore specific biological questions. In particular, he noted, they are doing studies on protocadherin proteins and on processes linked to cancer.

The researchers plan to continue to update the PrePPI resource as additional structural data becomes available and as they make adjustments to the algorithm, Honig said.

"There's new data and, frankly, our [algorithm] ideas keep getting more refined," he said. "So we're going to have to update it every few months."

The Scan

Genetic Risk Factors for Hypertension Can Help Identify Those at Risk for Cardiovascular Disease

Genetically predicted high blood pressure risk is also associated with increased cardiovascular disease risk, a new JAMA Cardiology study says.

Circulating Tumor DNA Linked to Post-Treatment Relapse in Breast Cancer

Post-treatment detection of circulating tumor DNA may identify breast cancer patients who are more likely to relapse, a new JCO Precision Oncology study finds.

Genetics Influence Level of Depression Tied to Trauma Exposure, Study Finds

Researchers examine the interplay of trauma, genetics, and major depressive disorder in JAMA Psychiatry.

UCLA Team Reports Cost-Effective Liquid Biopsy Approach for Cancer Detection

The researchers report in Nature Communications that their liquid biopsy approach has high specificity in detecting all- and early-stage cancers.