NEW YORK (GenomeWeb) – Harvard University researchers have published the largest protein-protein interaction study done to date, identifying more than 56,000 candidate interactions covering more than 25 percent of the known protein-coding genes in the human genome.
The study, published this week in Nature, is part of a larger effort by the researchers to work through the bulk of the human proteome and provides a vast data set for scientists investigating the fundamentals of proteome organization and other basic biological questions as well as research into questions like protein interactions and networks underpinning disease.
The researchers used affinity purification mass spectrometry (AP-MS), creating tagged bait proteins based on the human ORFeome, which they then expressed in HEK293T cells. They then pulled down these proteins and used mass spec analysis on a Thermo Fisher Scientific Q Exactive instrument to identify the proteins and their interaction partners.
According to the authors, their workflow allows them to identify protein interactions to as many as 500 human open reading frames per month.
The results of the recent study are compiled in the BioPlex 2.0 database, and adds more than 29,000 previously undetected interactions for a total of 56,533 interactions involving 10,961 proteins. The resource contains data from 3,297 new AP-MS experiments performed as well as 2,594 previously performed AP-MS experiments that have been reanalyzed in the context of the new interaction data.
Using unsupervised Markov clustering, the researchers were able to group the analyzed proteins into more than 1,300 different "protein communities," covering a wide range of cellular functions. Of these 1,300-plus communities, 442 were linked to disease, with more than 2,000 diseases annotated.
With the completion of the study, the researchers have now done a first pass through the entire ORFeome, as expressed in HEK293T cells, said Wade Harper, professor of cell biology and molecular pathology at Harvard and senior author on the paper. This represents around 13,000 different proteins, he said, though he noted that around 3,000 or so of these proteins could not be successfully analyzed due to challenges involved in expressing them in the HEK293T cells.
"Most often [in these cases] we were unable to make a stable cell line," he said. "The protein was toxic to the cell or the levels were too high or something like that."
Harper said that he and his colleagues planned to go back and revisit these proteins using different approaches that might enable their detection in the HEK293T cells. He added that they have also begun looking at interactions in different cell lines to determine what different proteins they might express and what the differences in their observed interactions might be.
Thus far, he said, they had found that for the majority of complexes, the interactions are "very much similar" across the different cell types.
"For the majority of complexes, you might call them 'housekeeping complexes,' they are doing the same thing in almost every cell," he said, adding that for the roughly 1,000 bait proteins they have analyzed in more than one cell line, at least 60 percent of the interactions are the same.
He suggested that this level of reproducibility indicated the effectiveness of the researchers' workflow, noting that, "Historically, if two labs do the same bait by AP-MS, oftentimes there would be very little overlap between the two experiments. Sometimes, depending on the bait, even in the same lab you might not have huge overlap. So, the fact that we were getting 60 percent overlap in different cell lines was pretty good."
That is particularly the case, Harper said, given the high throughput of the workflow. For instance, he said, "If you're talking about weakly associated, transient interactions, it matters a lot, for instance, how long your washes are and things like that. So, we try to standardize that, but, again, if you're talking about seeing one peptide or not seeing one peptide, things like the exact wash time may matter, and in a high-throughput [experiment] it's challenging to [account for] that."
Another limitation, he noted, is the fact that the bait proteins they introduce into the cell might not behave exactly as they would under normal biological circumstances.
"For any random bait you pick, you're tagging the C-terminus and you're putting it into a cell that it may or may not normally exist in," he said. "And so you have different possibilities where there might be an out-of-context situation. You have the potential that tagging the C-terminus kills the protein function, or it could change its localization or whatever. We can't control for that, because this is a production-type situation. [Controlling for this sort of issue] is something you can't do without doing a lower-throughput experiment."
High throughput is key to the researchers' goal of generating large-scale interaction networks across multiple tissues of interest. And such a resource can provide a starting point for additional, more targeted investigations of protein interactions.
"Once you have a map like we are trying to generate, then if you want you can go in and interrogate individual complexes [more comprehensively]," Harper said. One angle he and his colleagues are pursuing is exploring the dynamics of particular interactions they have identified.
"Most of what goes on in cells that we care about is dynamic in nature, and we aren't, in this large-scale experiment measuring the dynamics of any process," he said. " But there are methods to look at the dynamics of interaction. Once you have a map like we're trying to generate, you can combine quantitative proteomics with some sort of signaling treatment or treatment of cells with inhibitors or activators of a pathway and look at how the interactions are changing. So, that is partly where we are going."