Last month, Cellzome and MDS Proteomics unveiled their technologies for large-scale analysis of cellular protein complexes in competing articles published back-to-back in Nature. Their “test runs” in yeast reminded some of Celera’s publication of the fly genome two years ago and indicated that proteomics has come to be more than just a word. But meanwhile scientists have scrutinized the two data sets, and one analysis finds that only 20 percent of the yeast proteins identfied actually overlapped, even taking into account only those “bait” proteins the two groups used in common.
The reasons for the discrepancy are many, but the result illustrates a potential pitfall for researchers hoping new high-throughput methods for protein interaction analysis will immediately offer an accurate picture of a proteome — be it yeast or human. Although many scientists agree that fishing out protein complexes and identifying their components by mass spectrometry has advantages over other techniques, such as the yeast two-hybrid method, which only detects binary interactions, the remaining question is how to validate the data to find “true” interactions.
A side-by-side comparison is only possible for a fraction of the MDS and Cellzome data, because each group chose a different set of bait proteins to hunt for complexes. But Paul Tempst, head of the protein center at Memorial Sloan-Kettering Cancer Center, in New York, found 94 baits both studies had in common. Each had a number of proteins associated with them, and Tempst compared the overlap in each case.
The result: almost half of the pairs of complexes had no overlap at all, and almost a fifth shared only one protein. Overall, only about 20 percent of the proteins identified for the common baits matched. “I did not expect that it would be that much different,” Tempst said. According to MDS Proteomics, he missed a number of shared baits — they counted 115 — but including the additional ones does not significantly change Tempst’s overall results, he said.
However, MDS does not seem surprised. “What I attribute the biggest variation of the data [to] is run-to-run variability because it was not automated,” said Christopher Hogue, a bioinformaticist at the University of Toronto, founding CIO of MDS Proteomics and a co-author on the company’s Nature article. Cellzome, based on 13 repeated experiments, estimated only 70 percent reliability.
The method for protein identification may also present reasons for the slim overlap in the results. MALDI mass spectrometry, used by Cellzome, is likely to result in fewer identifications than LC/MS/MS, used by MDS, Hogue said. Indeed, MDS identified about a third more distinct proteins than Cellzome for the 94 overlapping baits. Also, each group chose to filter out a different set of omnipresent proteins they assumed were nonspecific binders just along for the ride.
Differences in experimental approach, including the cells studied, could account for further variation. “You will find that different experimental techniques will pull up different pieces of information,” said Hogue. As an example, he pointed to the TAP-tag method the Cellzome group used, which is likely to favor strong interactions but miss weaker ones. Overexpressing a flag-tagged bait like the MDS group, on the other hand, might lead to false positive interactions.
Tempst, for his part, views at least this excuse as invalid. “There should still be overlap because you might expect that if you have an approach that covers weak interactions, it will also cover strong interactions,” he said.
Whatever the reasons the two sets differ, many agree the data were produced in a rush, and validation experiments using different techniques are both time-consuming and costly. “These two papers were clearly racing to be first to publish,” said Hogue. “You cannot really look at either of these data sets […] as being a set of protein-protein interactions until they have been repeated with multiple orthogonal techniques,” he added, for example by performing reverse tagging experiments, yeast two-hybrid experiments, or genetics analysis.
Cellzome says it has internally validated some of its interactions, and Hogue has attempted another form of validation by comparing both data sets with other yeast interactions listed in the BIND database — a collection of protein interactions so far derived primarily from yeast two-hybrid studies. His recently completed bioinformatics analysis, covering 15,000 known interactions in yeast, shows that “it’s only when you put the data together that you see the real information dropping out,” he claimed.
Peer Bork, a bioinformaticist at EMBL in Heidelberg and a co-founder of Cellzome, is currently preparing a similar analysis based on the YPD and MIPS databases. “What we see is that both datasets, both the one from MDS and the one from Cellzome, do reasonably well in reproducing what was there,” he said, indicating that they may be different yet complementary.
The problem is how to make large proteomics datasets comparable, especially when researchers move on from a single-celled organism to humans. “How are you going to dissect interactions and networks, if using the same protein as bait, one finds 15 proteins, the other one finds 20 proteins, and they are all different?” asked Tempst, who nevertheless believes the complex capturing method is an excellent technique for studying protein interactions.
Tempst and Bork agreed that in the future, organizations such as HUPO could attempt to remedy the problem by introducing experimental standards. A standard set of monoclonal antibodies could serve that purpose, Tempst said, although he admitted that the cost of producing such standard antibodies would be “enormous.”
Hogue and others think that it is too early to define any standards. “We have to get through yeast before we know what works,” he said.
In the meantime, Tempst said there will still be a place for small hypothesis-driven studies for some time to validate the growing stream of large-scale protein interaction data. “Clearly, something useful is going to come out of it, if only that in the future you have to be careful,” he said. “In the end the data will or will not stand up.” — JK