Skip to main content

Commentary: Data Handling, Biases Can Yield ‘Notoriously Unreliable’ PPI Outcomes

As protein-protein interaction studies generate reams of new data, a team of researchers claims that biases that arise from different methods of handling the data make the information “notoriously unreliable” for large-scale hypotheses about protein-protein interaction networks.
As proteomics continues maturing as a field of research, much focus has been cast on protein-protein interaction and the biological properties underlying such networks. But in a commentary in the January issue of Nature Biotechnology, researchers from the University of Manchester in the UK argue that while technological advances in proteomics has led to “unprecedented quantities of protein-protein interaction data,” poor data-handling protocols and biases in the data have made the inference of “meaningful biological conclusions from network topology … problematic.”
“You need to be very careful about the biological conclusions you draw from the data,” Simon Lovell, co-senior author of the commentary, told BioInform’s sister publication ProteoMonitor this week. “The data itself is fine … but these large-scale parameters you get from the network — is it highly connected, is it apparently modular, [and] is it scale-free? — you need to be very careful about what they actually mean. Those might be correct for particular data handled in a particular way, but it’s difficult to know if they’re actually telling you anything about the underlying biology.”
According to Lovell, the commentary sprung from the work of Mario Beyer, a co-author who was looking into protein-interaction networks and found that different methods of handling data led to different results.
“We thought that was a bit strange,” Lovell said.
Lovell and his fellow researchers say in their commentary that an estimated half of all protein-protein interactions in Saccharomyces cerevisiae have been identified, while an estimated 10 percent of all PPIs have been identified for human beings. Ideally, these network samples would be unbiased, but in fact, they say, that is not the case.
Even with random sampling, incompleteness has an “enormous” effect on the overall topology of the network, and the interactions that have been identified are “by no means random,” according to the authors. Biases in samples lead to significant differences between the whole network and subsamples that have been observed.
Furthermore, data from high-throughput studies are biased “toward proteins from particular cellular environments, toward more ancient, conserved proteins, and toward highly expressed proteins.” Coupled with the incomplete data, the high number of false positives has resulted in data that is “notoriously unreliable.”
To work around this, a number of datasets created by multiple validations have been proposed. In their commentary, the Manchester researchers point to the “filtered yeast interactome” dataset developed by Jing-Dong J. Han and colleagues and the “high confidence” dataset developed by Nizar Batada and colleagues.
Such approaches remove certain kinds of interactions, including false positives, making it in one respect a desirable way to improve data quality. But multi-validation introduces biases of its own by rejecting certain classes of data, the authors stress in their commentary.
“For example, the set of interactions that will be retained will tend to be biased toward those that are highly studied,” they write. “This has a drastic effect on network topology.”
They demonstrate these effects using the “filtered yeast interactome” and the “high confidence” datasets, as well as the “LC” dataset developed by Teresa Reguly and others.

Data from high-throughput studies are biased “toward proteins from particular cellular environments, toward more ancient, conserved proteins, and toward highly expressed proteins.”

Removing interactions leads to “radical” changes in many network properties, the authors write, adding, “the effects of multiple validation will be similar in any dataset until we have unbiased multiple observations of all interactions.”
Even the highest-quality datasets currently available, they argue, contain interactions that are reliable but not necessarily representative of the PPI networks as a whole, making any assumptions about the networks as a whole “problematic.”
The result is that any biological inferences drawn from protein-protein interactions and their networks may be error-laden. Biases inherent in protein interaction datasets can produce misleading results, an effect that can be further exacerbated in multi-validated datasets.
“Thus, what is a topological analysis of these samples really telling us about the complete network, let alone biology?” the authors ask in the commentary.
According to Lovell, small-scale studies present fewer problems because they tend to use better controls, and the researchers performing them tend to have a better understanding of the biology. The problem, he said, comes when someone makes great logical leaps about, for example, all of yeast as a whole.
In their commentary, Lovell and his colleagues offer no quick fix to get around the problem. The technology and information that is currently available may make it impossible to make global statements about PPI networks.
“We just need more data that need to be tied back to biological understanding — not just data from one source, but data from many sources,” Lovell said.
The commentary comes as an increasing amount of research is being done in protein-protein interaction in parallel with single-protein identification as a way of unlocking the complexities of the protein galaxy.
During the summer, Joshua LaBaer, head of Harvard Medical School’s Institute of Proteomics, called for more work looking into protein-protein interactions, saying research in the field was still in its infancy; beyond whether one protein binds with another, little else is understood about how specific proteins act and react with each other.
In an e-mail, Marc Wilkins, a professor of systems biology at the University of New South Wales and co-founder of Proteome Systems, told ProteoMonitor that the study of protein-protein interactions is still in the middle of a “ramp-up period … as we try and understand the data and perhaps even the best biological questions to be investigated by interactomics research.”
Wilkins said that work into protein-protein interaction networks is still new and what is happening in the field is no different from what happens in other scientific fields.
It “will take a little time to generate agreed data sets,” he said. “As this happens, there will be some biological insights and hypotheses generated that stand the test of time and others that do not.”
Because there “is no perfect means of measuring PPIs, the PPIs themselves are a mixture of stable and transient interactions and in a constant state of flux, and we do not yet have any clear picture on the extent of interactions that actually exist, experiments do require considerable care in their execution and interpretation,” Wilkins added.

A version of this story previously ran in this week’s issue of ProteoMonitor, a BioInform sister publication.

Filed under

The Scan

Pfizer-BioNTech Seek Full Vaccine Approval

According to the New York Times, Pfizer and BioNTech are seeking full US Food and Drug Administration approval for their SARS-CoV-2 vaccine.

Viral Integration Study Critiqued

Science writes that a paper reporting that SARS-CoV-2 can occasionally integrate into the host genome is drawing criticism.

Giraffe Species Debate

The Scientist reports that a new analysis aiming to end the discussion of how many giraffe species there are has only continued it.

Science Papers Examine Factors Shaping SARS-CoV-2 Spread, Give Insight Into Bacterial Evolution

In Science this week: genomic analysis points to role of human behavior in SARS-CoV-2 spread, and more.