Following the release of the first protein-protein interaction map for yeast in 2000, a number of similar data sets have become available, promising a valuable source of information for biologists interested in studying individual genes or proteins in the context of their sub-cellular behavior. But these data sets have failed to live up to their promise as discovery tools because they exhibit very high false positive rates: It’s been estimated that only 30 to 50 percent of the interactions depicted in such maps are real.
Faced with this challenge amid a sea of data from its in-house protein interaction projects, researchers at CuraGen developed a computational method to cull the biologically relevant interactions from a complex web of putative hits, and improve the company’s target validation process along the way.
Protein interaction maps are potential gold mines of information, but currently, “people don’t really know how to tell a false positive from a credible interaction occurring in vivo,” said John Chant, head of genomics and proteomics and the coordinator of the biomarker discovery program at CuraGen. “This is a problem with all large-scale genome sets of data — telling what’s noise and what’s meaningful is a challenge.”
While more established areas of genomics have already addressed this problem, researchers are still staking new computational ground in the area of protein interaction analysis. As a first step, the CuraGen team, in collaboration with Joel Bader — a former CuraGen scientist now at Johns Hopkins University, developed a statistical approach for deriving a probability score, or “confidence metric,” akin to the Phred score in base calling or error bars for gene expression data. The method considers every data point in the interaction map and assigns each interaction a score on a scale of zero to one, with zero signifying no probability that the interaction is biologically relevant, and one signifying a 100 percent probability that the interaction is biologically relevant. “We converted it to a scale that’s easy for biologists to use,” Bader said.
Previous methods for assessing putative protein-protein interactions involved intersecting several data sets, a process that results in a high-confidence, but extremely small, set of overlapping interactions, Chant said. Other methods rely on the “anecdotal biases” of co-expression or other biological data to determine what’s real and what’s not real. The challenge, said Chant, was coming up with a statistical framework that would consider only the proteomics data, and assess each interaction individually to assign a confidence score to each data point.
The method, described in the January issue of Nature Biotechnology, uses the topology of the network — the number of interactions assigned to each protein — as the primary measure of confidence. “The idea there is that if two proteins interact with a lot of proteins in common, then we can have higher confidence that they also interact with each other,” Bader said. Other parameters include protein “promiscuity” — if a protein interacts with many other proteins, then the confidence score for each of those interactions is lower — and screening statistics, in which “we sample possible interacting partners again and again, and if we see the same pair of proteins showing up again and again and again as interaction partners, then we can be more confident about it.”
After predicting confidence scores for protein interactions derived from both yeast two-hybrid and mass spec experiments (6,395 and 41,775 interactions, respectively), the CuraGen team validated its results using gene expression data and protein annotation from the MIPS database. A confidence score threshold of 0.65 yielded a subset of 3,854 high-confidence interactions involving 2,262 proteins.
More importantly, CuraGen found that once it had a network of high-confidence interactions to work with, it could use that map as a tool for merging proteomics data and gene expression data to identify protein complexes involved in specific biological roles. They identified, for example, cell-cycle proteins that are not identified by gene expression studies alone because their transcripts are out of phase or don’t follow a cyclic expression pattern. Chant said that the method is a key component of CuraGen’s technology pipeline because it “helps put these targets into a biological or human health context for developing drugs.” Several of the company’s pharmaceutical partners “actively use this technology to validate targets, and to move targets forward,” he said.
Chant said that CuraGen is actively “improving the sophistication of these methods” to keep pace with new protein interaction data, as well as other types of large-scale data sets that will have to be integrated with the network data. “Going forward in human systems, there’s expression data, there’s genetic data, there’s soon to be large RNAi data sets, there’s protein localization data … One can certainly combine them, but figuring out how to combine them as effectively as possible to sort of mimic what a biologist would do over, say, 20 years in their lab is the real challenge.”
Bader noted that “this is really just the beginning of large-scale data sets looking at biological networks,” and that an “outpouring” of information on protein networks, gene regulatory networks, protein-DNA interactions, and other systems is on the horizon. Several other research groups, including Fritz Roth’s lab at Harvard and Mark Gerstein’s group at Yale, have also made similar progress in assessing the relevance of protein interaction networks via computational and statistical methods, he said, but “it’s only recently that these really large-scale data sets have been produced that demand this extra type of analysis.”
With only a handful of data sets available to work with so far, “it’s just sort of the ramp-up period for these types of experiments,” Bader said, “but there’s going to be a similar ramp-up in the statistical and computational efforts to analyze the networks.”
Chant agreed. “We’ve made a great start, but it’s relatively early days,” he said.