This story originally ran on June 2 and has been updated to include comments from an outside researcher.
Baylor College of Medicine scientists have completed an almost decade-long study of endogenous coregulator protein networks that suggests that roughly half the human proteome is involved in DNA transcription.
Their work, which was detailed in an article published this month in Cell, used data generated by mass spec analysis of 3,290 affinity-purified protein complexes from HeLa S3 cells to identify more than 11,000 proteins involved in regulating gene expression.
In addition to demonstrating the extensiveness of the protein coregulator complexome, the study also provides a conceptual framework for understanding protein-protein interactions throughout the body and for investigating their roles in polygenic diseases, said Bert O'Malley, chair of Baylor's Department of Molecular and Cellular Biology and lead author on the paper.
The research, O'Malley told ProteoMonitor, follows the growing understanding that, in vivo, individual proteins are rarely found interacting with one another in isolation. Rather, he noted, proteins typically act in complexes, with "one protein complex [interacting] with another protein complex."
Understanding the role of proteins involved in regulating a process like transcription, therefore, means identifying not just the constituents of a given protein complex, but identifying what individual complexes interact with what other complexes as well.
This meant taking a different approach to antibody-based affinity purification than is typical in proteomics work, O'Malley said. He noted that, while usually researchers try to purify target proteins as thoroughly as possible before mass spec analysis, their aims called for a less stringent separation.
"In proteomics it's usually, 'Let's really purify this complex so you don't have non-specific proteins and so forth,'" he said. "But when you purify a complex, first you lose some of the proteins in the complex, and you also lose almost all the proteins [from other complexes] that touch the [target] complex."
After starting out using rigorous column-based separation techniques, the researchers "had to back up and do a very gentle purification, pulling down all the proteins" in order to capture not just the target complex, but other more loosely associated proteins as well, O'Malley added.
This shift in strategy also changed the researchers' approach to antibody selection, he said, noting that in the beginning "it was a prime factor on our list to get very specific antibodies that we were sure only reacted against the protein that we wanted."
Such specificity turned out to be relatively unimportant, though, due to the vast amount of protein complex data they generated and the ability of their bioinformatic tools to sort through that data on the back end to identify proteins associating together in complexes.
"As we got into it we realized that after you create a certain [size] database, the antibody specificity becomes less relevant," O'Malley said. "It's all about having a massive amount of data. Then through that data you can go back and determine what the core complexes are and what complexes touch those complexes."
Baylor professor Jun Qin led the mass spec analyses used to identify the purified proteins. Here, too, the researchers saw a shift over the course of the project, O'Malley said, as advances in technology – the team started with a Thermo Fisher LTQ ion trap before moving to an LTQ Orbitrap Velos machine – allowed them "to get deeper and deeper" into the proteome.
Ultimately, they built an interaction dataset placing 11,485 unique proteins within a three-tiered schema consisting of core protein complexes, which they termed minimal endogenous core complex modules, or MEMOs; variations on these core complexes, termed unique core complex isoforms, or uniCOREs; and multi-complex structures, which they called complex-complex interaction networks, or CCIs.
The dataset will be available through the website of the Nuclear Receptor Signaling Atlas, a trans-National Institutes of Health consortium focused on nuclear receptor and coregulation signaling. With the Salk Institute's Ronald Evans, O'Malley is co-director of NURSA, and his lab is responsible for maintaining its transcription factor database.
The Cell study has added a considerable number of proteins to that database, O'Malley said, noting that when the researchers began their work they found roughly 400 proteins that had been tied to regulation of DNA transcription in previously published studies.
"I thought, well, we'll at least hit 500 or 600 proteins. Now we're over 11,000," he said, adding that the project has given him an appreciation for the complexity and the importance of the proteome. "I went into this a decade ago as more of a DNA transcription man. Now, I am a huge believer that the future of this field is proteomics."
"One of the proteins we work with a lot is a [transcription] co-activator called SOC3, and we have found over 40 post-translational modification [sites] on that protein," O'Malley said. "So what are the potential combinations of post-translational modification that can occur in a protein with 40 PTMs? Well, it's 240 power, which is 1012. Now, that one protein works in a complex with another nine proteins, so you calculate the complexity of that [set of combinations] and you start to really understand the complexity of the proteome."
This "is how we as humans can do what we can do and a worm can only do what it can do [even though] we have the same number of genes," he said. "It really emphasizes proteins and says this is where we need to go to understand the complexity of mammals and humans. It's all about combinatorial events."
[ pagebreak ]
"It's a really great story," said Anne-Claude Gavin, a researcher at the European Molecular Biology Laboratory who studies protein-protein interactions.
Gavin, who was not part of the Baylor project, highlighted, in particular, its use of affinity purification combined with mass spec to look at protein complexes, which, she told ProteoMonitor, she believes is the first such effort in humans.
"There have been some efforts [studying] protein-protein interaction with yeast two-hybrid [systems]," she said. However, "with the two-hybrid method you identify slightly different types of interactions. You tend to find more transient interactions and binary interactions, whereas with the [affinity purification] what you purify is really the protein machine. So this [shows] really how proteins stably interact to form a machine."
'New Medically Relevant Information'
Next, O'Malley said, he plans to apply what the researchers have learned about protein complexes to more applied medical investigations. A better understanding of protein complexes, he suggested, could give scientists new insight into the genes and proteins involved in various polygenic diseases.
For instance, in the course of their work, the researchers pulled down a complex involving the protein Sin3B, which, O'Malley said, is thought to be potentially oncogenic. Looking at the complex further, they identified three additional oncogenic proteins.
"Now we have three or four proteins that are oncogenic in a complex; we know all the other proteins in the complex; and our prediction is that they are all oncogenic, because if they're working together and some of the proteins are oncogenes, then the others working with those proteins are oncogenic, too, and just haven't been described yet," he said. "That's new, medically relevant information that you can test out."
"If a complex works together – let's say for growth for oncogenesis – then every protein in that complex plays some fractional role in the output function of that complex," O'Malley added. "So, now you start thinking about polygenic inputs to diseases, and now you really realize how you can have three or five or 10 different genes having an input to one disease."
He cited a study that he and George Washington University researcher Rakesh Kumar published in the March edition of the Proceedings of the National Academy of Sciences in which they identified a complex containing both the Parkinson's protein DJ1 and metastasis protein 1, MTA1.
"I called up the guy who had cloned this MTA1 and knocked out the gene in the mouse and asked him if his mouse had a neurological phenotype," O'Malley said. "So, they tested the mouse and it has Parkinson's-like syndrome. You assemble the [protein] complex and you start to realize how you have polygenic input to Parkinson's disease."
"There are so many genes you find with minor mutations when you do genome-wide [association studies], but they don't look like they have anything to do with each other, and you have to do a huge number of sequences before you get anything statistically [significant] and even then it doesn't quite make sense," he said. "But if you knew these things were working together in a [protein] complex, it would make sense immediately."
While the Baylor group's work focused on transcription regulators, the purification and mass spec workflow, as well as the informatics techniques used, should be applicable to other classes of proteins, O'Malley said.
Although there's been some talk of commercializing the software used to analyze the mass spec data and determine the various complexes, the researchers plan to offer it freely, he said.
"We're funded by NIH, and NIH wants us to make this open access," he said. "A scientist doesn't mind making a few dollars selling something, but when you do something that's big and for the first time, you really would like people to use it. It's not the sort of thing you're going to get super rich off of anyway."
Have topics you'd like to see covered in ProteoMonitor? Contact the editor at abonislawski [at] genomeweb [.] com.