Scientists Near Compromise in Debate over Quality of Curated Protein-Protein Interaction DBs


By Vivien Marx

A heated debate over the quality of literature-curated protein-protein interaction databases may be nearing a compromise as scientists on both sides of the issue are working together on a forthcoming joint commentary in Nature Methods that aims to bridge the differences between the two camps, BioInform has learned.

The joint commentary will explore methods to help curators ensure that these databases are providing high-quality data, as well as enabling users to search more effectively. It could also end a squabble that has been brewing since January when several researchers, led by Michael Cusick of the Dana Farber Cancer Institute, published a study in Nature Methods that called literature-curated PPI databases "error-prone."

At the time, some PPI database curators told BioInform that the study's findings were faulty and that many of the errors reported by the authors were actually correct in the databases. Earlier this month, representatives from several PPI databases published their formal response in a correspondence to Nature Methods, claiming that the authors "arrived at their conclusions by misunderstanding the difference between the reliability of experimental data supporting protein interactions and the correctness of the curation process itself."

In the same issue of the journal, Cusick et al. also published an addendum to the original paper that sought to define the aims of the study and clarify the initial results.

The exchange between the two communities has arisen from disagreements in a number of areas, including fundamentals about scientific-data generation, the curation process, and the general perception of these databases among users.

What is a Curation Error?

In response to the initial Cusick et al. study, representatives from IntAct, DIP, MINT, TAIR, and BioGrid — all part of the International Molecular Exchange, or IMEx, Consortium — wrote that their resources aim to "incorporate the complete data as presented in the source publications, rather than selecting evidence they consider more reliable or otherwise privileged."

PPI databases "always fully curate a given publication and would consider it an egregious omission if only a subset of the protein interactions reported in a publication or its supplementary material would be contained in the database."

As a result, they "strongly object" to the idea that inclusion in a database of a PPI with "limited supporting evidence" should be considered a "curation error," as the original study maintained.

Cusick et al. "define a set of criteria for a specific use restricted only to direct pairwise protein-protein interactions, which they refer to as 'binary' interactions. They evaluate literature-curated datasets against these criteria and then assert that failure to meet their criteria represents 'incorrect curation,'" the database representatives wrote.

The IMEx authors, led by Lukasz Salwinski of the University of California, Los Angeles, Department of Energy Institute for Genomics and Proteomics, reanalyzed the interactions that were deemed erroneous in the original paper to identify "actual curation errors, defined as inconsistencies between the original published data and their representation in our databases."

They found that the rate of actual curation errors in the sample interactions was "consistently under 10 percent" — as opposed to as much as 45 percent as Cusick et al. found.

For example, in re-analyzing the subset of BioGrid yeast data, they found an error rate of 4 percent — an "order of magnitude" lower than reported in the initial study. Meantime, in the Arabidopsis dataset from IntAct, the curation error rate was 2 percent compared to 10.2 percent found by Cusick and his colleagues, and for TAIR the error rate was 3 percent, which is one-third of the figure found by DFCI-led team, the IMEx team said.

Binary Code

One source of the debate surrounding PPI database curation is differing views for a number of terms and concepts.

For example, in addition to refuting the error rate claims by the DFCI-led team, the database team criticized Cusick and colleagues for defining "binary" interactions as meaning direct interactions with multiple independent supporting reports.

Although a "valid use," of the term, the IMEx team noted, they themselves use it differently: to refer to "any interaction," making "no judgment" whether or nor the interaction is "direct or indirect."

Cusick et al. respond in their addendum, however, that "a meaningful fraction of database users is under the impression that 'binary interaction' means direct pairwise PPIs, and that is the definition we tried to apply."

While the IMEx team's definition of binary is "technically correct from an informatics viewpoint, binary representation likely does not accurately reflect biophysical reality," Cusick and colleagues state.

Another source of confusion stems from perceptions about the role of curated PPI databases. Cusick said last week in an e-mail to BioInform that these databases are representing themselves as "data repositories" that make "minimal effort to assess PPI quality," but noted that this role is counter to "widespread community perceptions."

Marc Vidal, another author on the original critique, told BioInform that there is a "disconnect" in the community because many published papers have called literature-curated data the "gold standard." He added that this perception has taken hold in the community, particularly among clinicians or microbiologists who know very little about the curation process and may not realize that PPI database curators are not evaluating the quality of the interactions in the literature.

Vidal, who runs the network biology group at DFCI's Center for Cancer Systems Biology, spoke to BioInform last week on the sidelines of the annual symposium of the Systems Biology Center of New York at the Mount Sinai School of Medicine, where he was speaking.

He stressed that he and his colleagues are not criticizing curation in and of itself, and agreed that it is not the role of curators to judge the quality of publications. The challenge for curators is that not every paper is of the same quality. "It's just a reality," he said.

Vidal did defend the original study in which he and his colleagues "re-curated" these resources, however, noting that they found "enormous problems" in some protein-protein interaction databases. Indeed, in their Nature Methods addendum, they note that after revisiting the different databases, they found that 95 percent of the "problematic curation units" they detected in their original study came from HPRD and BIND, which are "non-IMEx" databases.

For IMEx databases, they added, "there is minimal difference in error rates between our recuration and that of Salwinski et al."

Another common criticism of curated PPI databases is that there is little overlap between them, but this is actually by design, Michael Tyers, a researcher at the Wellcome Trust Centre for Cell Biology at the University of Edinburgh and principal investigator of BioGrid, told BioInform via e-mail.

Tyers explained that the IMEx databases determined "not to curate the same publications in order to maximize use of curation resources."

In their addendum, Cusick et al. acknowledge this, and noted that "the problem of low overlaps should be mitigated once the IMEx exchange of curation between databases becomes implemented."

Tyers echoed Vidal, saying that the role of PPI database curators "is not to re-evaluate the peer-reviewed literature, but to efficiently extract information from the biological literature in a structured format for use in computational and comparative approaches to understand biological networks.

"It is the nature of biological science that discrepancies arise, often because of subtle contextual differences in experimental design; these discrepancies can only be resolved by further experimentation, including tests of predicted interactions," he said.

He also said it is important to note that current interaction databases "do not claim that all curated data is of a 'gold standard.'"

Seeking Middle Ground

Efforts are underway on both sides of the issue to ensure that researchers are getting the most from curated PPI databases.

For example, curators currently view "the high-level curation" of network structure and information flow as a "critical ongoing issue," Tyers said. This information "often cannot be easily deduced from component interactions alone," so discussions are "underway between many groups as to how to best capture pathway and network features."

In addition, some have suggested that scientist begin to "curate their own papers" with summary tables to facilitate the process of getting that data into databases, Vidal said.

And, as Cusick et al. explained in their addendum, some projects are underway to generate "confidence scores" for curated, predicted, and experimentally obtained protein-protein interactions — measures that should be encouraged and "appropriately funded."

As Cusick explained via e-mail, he and his colleagues see "many important uses" for data repositories, and their expansion "should be encouraged."

Nevertheless, "the path from repository to database is unclear," he said. Exploring this path is one of the aspects he hopes the joint effort can address and help to clarify.

