A new study published in Nature Methods that determined that literature-curated protein-protein interaction databases "can be error-prone and possibly of lower quality than commonly assumed" has touched a nerve among proteomics researchers.
Lead author Michael Cusick of the Dana-Farber Cancer Institute and Harvard Medical School told BioInform that he and his colleagues have been overwhelmed by the numerous and strong responses to the paper.
Indeed, some curators of these databases are claiming that the study's findings are faulty. Henning Hermjakob, team leader of the European Bioinformatics Institute's proteomics services group and co-developer of the IntAct database, told BioInform via e-mail that "while curation errors undeniably happen, our analysis of the IntAct Arabidopsis data presented in the [Nature Methods] manuscript indicates that the majority of errors reported by the authors are in fact correct in the database."
He said he and his colleagues have sent comments to the study's authors and requested an erratum from Nature Methods. In addition to Arabadopsis, the study also examined yeast and human datasets, but Hermjakob noted that a detailed analysis of these results will be more difficult, "as the authors have unfortunately analyzed data from multiple databases without an interaction-level source attribution."
Cusick defended the study's scope and intentions. "In recent years many reports had hinted that curated information may not be of the impeccable quality that is commonly presumed, consistent with our findings," he told BioInform in an email. "People had a hunch."
Scientists have also assumed that literature-curated PPI data were more reliable than high-throughput datasets, but nobody had ever actually tested that assumption, he said. "Our publication shows that these assertions are not validated."
Most PPIs have not been verified by multiple methods or groups, and most of them are not well-controlled because they are from medium to high-throughput screens, which use " the same methods that are denigrated in the literature," he said. In addition, most PPIs do not have supporting cell biological information, he said.
"Even the 'scrutiny of peer review' can be questioned, as all scientists know of published reports with erroneous data," Cusick added. "However, we emphasize that evaluation of the quality of the underlying data is far beyond the scope of our analysis. We were careful to make this point in our paper."
Why Interactions Matter
As the authors outlined in the study, knowing about all possible protein-protein interactions is "an essential component" of systems biology.
Literature-curated PPI databases such as IntAct, MINT, MIPS, and BIND offer an "advantage" over high-throughput PPI experiments such as yeast two-hybrid assays or co-affinity purification and mass spectrometry because the literature-based approach is "hypothesis-driven," they wrote.
That means the literature "often, though not always" stands to help scientists understand the biological functions of interacting proteins.
Datasets from the literature are used to appraise the reliability of experimental PPI datasets, and having high-quality data is "integral" in estimating the reliability and size of interactome maps, the authors said.
"The superior reliability of literature-curated PPI datasets, versus high-throughput datasets, is generally presumed," they wrote, adding that this presumption has "not been thoroughly investigated."
[ pagebreak ]
As part of their analysis, the scientists found that many available literature-curated PPI datasets are populated from PPIs gathered in high-throughput experiments.
Using the repository for interaction datasets BioGRID, which is hosted at Mount Sinai Hospital in Toronto, the researchers ranked 11,858 yeast PPIs curated from the literature and found that 75 percent of these interactions were described in one publication only. One-quarter of them were described in multiple publications, 5 percent of them were in three or more publications, and 2 percent in five or more publications, the team found.
Looking at slightly more than 4,000 human PPIs, the team found that only 15 percent had been described in multiple publications, and when they looked at Arabidopsis PPIs, 93 percent of literature-curated PPIs were from data in a single publication. "All told, the number of PPIs supported by data in multiple publications is too small," the scientists wrote.
The authors admitted that they cannot deliver an assessment of the "completeness for literature-curated datasets" so they evaluated database overlaps "as a surrogate for completeness," based on the line of reasoning that the differing PPI databases "should curate from the same set of PubMed reports."
They found that "surrogate estimates of completeness of literature-curated datasets, at least for yeast, suggest that coverage of curated literature is far from comprehensive."
The team focused on the three members of the International Molecular Exchange Consortium IMEx — MINT, IntAct, and DIP — and found "they had surprisingly low overlap of curated PPIs."
"That the overlap is so small after years of intense curation of protein interactions is reason for concern," the authors noted.
Honing in on yeast data to study the "actual reliability" of literature-curated PPI datasets, the scientists then re-curated 100 randomly selected pairs of interactions, assigning confidence scores to each one. They found that 75 percent of the interactions could not be substantiated and 35 percent of the pairs were inaccurately curated.
In the view of the authors, "observations explain the poor reliability, relative to high-throughput datasets, of the singly supported literature-curated dataset in both computational and experimental analyses."
When the researchers re-curated human PPIs, they created two datasets, one of 188 interactions with higher confidence values and one of 188 pairs with lower confidence values. In the first set, 38 percent of the initial curation unit values turned out to be wrong, with the most common error being assignment to the wrong species and the lack of an experiment supporting the interactions, they found.
From the set with lower confidence values, the scientists removed interactions with more than one supporting publication and found that among the 160 pairs, 45 percent of the interactions were not validated, with the most common error being wrong species and erroneous protein name.
The scientists re-curated 100 higher confidence interactions for Arabidopsis and found "improved" results relative to yeast and human.
The scientists stated that their findings of "large error rates in curated protein interaction databases, at least for yeast and human," are in line with "recent hints" in the literature that "the quality of literature-curated datasets may not be as high as widely perceived."
They hypothesize that while curator error may "occasionally" be responsible, they believe that errors are due to the "simple reality" that it "can be extremely difficult" to extract accurate information from a free-text document and that literature curation is "often underappreciated." Gene name confusion is one "particularly thorny issue," they wrote.
The lack of formal representation of PPIs in published manuscripts makes it difficult "if not impossible" to extract data in usable form and, for example, the designation of the species of origin of the protein interactors, "an absolutely critical piece of information, is often buried or lacking altogether." Other problematic factors include the lack or omission of standardized descriptions.
[ pagebreak ]
The scientists stated in their paper that curation difficulties would be eased if researchers submitted their PPI data in standardized format to databases.
The study authors believe that the fact that literature-curated datasets have "inherent reliability difficulties" should influence thinking about the "proper generation of positive reference sets." Once structured digital abstracts gain in popularity, curation will be "greatly" improved, they wrote.
Some curators of these protein interaction databases expressed concern about study's findings.
"The very aim of the IMEx consortium is to reduce redundant curation of the same publication by more than one database, as this is ineffective use of public funding," Hermjakob told BioInform.
IMEx partners, he said, have instead agreed on different areas of curation, mostly according to journal. This fact is also mentioned in a paper cited in the study, he said, "Broadening the horizon – level 2.5 of the HUPO-PSI format for molecular interactions" published in BMC Biology in 2007.
"The policy should be known to Dr. Cusick, who is actually a co-author" of that reference, Hermjakob said.
IntAct, developed by Hermjakob and Rolf Apweiler at EBI along with colleagues at institutes in Italy, Germany, Switzerland, Denmark, Spain, Israel, France, and the UK, relies on data provided by other resources, including MINT, BIND, DIP, the comprehensive yeast genome database CYGD, and STRING, a database of predicted functional associations between proteins.
"Independent of our reservations with regard to the reported curation errors" in the Nature Methods paper, Hermjakob said he and his colleagues agree that extracting accurate information from long free-text documents can be extremely difficult.
IMEx partners DIP, MINT, and IntAct are intensely working with journals and editors to encourage direct data deposition in public databases, and as recommended in a series of editorials and publications in several journals, he said.
As community awareness rises, "we are seeing a rising level of direct depositions, and in the reporting period April 2007 through March 2008, the IntAct database for the first time had more new binary interactions from direct depositions than from curator-initiated literature curation," Hermjakob said.
When curating published manuscripts, IntAct curators attempt to contact the authors to resolve ambiguities, "rather then just omitting the information or making an educated guess, as suggested by Cusick et al.," Hermjakob said.
"Upon release of a curated dataset, we send an e-mail to the publication authors asking for suggested corrections."
[ pagebreak ]
Hermjakob underscored the point made by the study authors that the curation difficulty arises partly because of a lack of standardized formats in submissions of PPI data. "We completely agree with this statement," he said.
Gianni Cesareni of the University of Rome Tor Vergata and co-developer of the MINT database, said he, too, was not surprised by the Nature Methods study finding that PPIs curated by the three IMEx databases have little overlap. "It is no reason for concern," he said.
Once the exchange of completed records is fully operational, for which "a few technical issues need to be resolved," the three databases will contain substantially the same data, he said. He said he and his colleagues were surprised that the study authors appear unaware of this development.
MINT, which was introduced in 2006, was designed to collect experimentally verified PPIs. MINT curators extract information about molecular interactions from peer-reviewed journal articles with a view to physical but not genetic or computationally inferred protein-protein interactions.
"The ideal curator is an encyclopedic expert who knows everything in each domain of modern biology," remarked Cesareni, adding that "in reality, curators are PhDs with experience in specific domains." After some additional training, they "face the daunting task of accurately identifying and extracting the most relevant information from complex and sometimes incomplete or inaccurate experimental reports."
"Curators make mistakes as any human being does," Cesareni said. Scientists, he said, must do their part by providing accurate and complete information. "It is common to find a manuscript reporting that protein x binds to protein y without providing any clue about the organism, [whether] mouse, rat, human [or another organism] where this interaction might occur."
This difficulty of curation has been recognized by the scientific community working in the protein interaction field, and Cesareni noted that efforts are underway to address this issue, such a-s MIMIx (Minimum Information required for reporting a Molecular Interaction Experiment), an initiative under the Human Proteome Organization's Proteomics Standards Initiative.
In addition, he noted that HUPO PSI is urging journals to require that "minimal structured information" be submitted in public resources together with "traditional manuscripts" that report protein interaction information.
"This suggestion is being taken up slowly by authors and by editors who fear to discourage submissions by being more demanding with authors," he said.
FEBS Letters, a journal for which Cesareni is an editor, in collaboration with the MINT database and the publisher Elsevier, "was fast in establishing a policy by which authors are asked to submit such structured information when the paper is accepted," he said.
The structured information is stored in a database and published as a structured digital abstract alongside traditional abstracts in the print and online version [BioInform Aug. 1, 2008].
"This experiment was designed to evaluate authors' willingness and competence to comply with such a task," Cesareni said. "After almost a year of experimentation we will soon come up with a report describing the experience. We hope that this report will represent a stimulus for other scientific journals to join in and collaborate with databases."