A team led by University of Edinburgh researchers has developed a machine learning technique for identifying the protein components of interphase chromatin.
Detailed in a paper published this week in The EMBO Journal, the study identified a number of proteins not previously characterized as chromatin associated and perhaps points toward a new approach to organelle proteomics, Juri Rappsilber, a University of Edinburgh researcher and author on the study, told ProteoMonitor.
Traditionally, organelle proteomics has focused on purifying the structures of interest and then characterizing the proteins present to generate a sort of "parts list," Rappsilber said.
This approach, however, assumes that organelles are discrete, definable units that can be biochemically purified – an assumption that, in many cases, is not borne out by reality, he said.
In fact, an organelle like interphase chromatin, "is not defined by a sharp boundary," Rappsilber said. "It is open access. Any protein can go in and out of chromatin once it is in the nucleus, and [the chromatin] has a huge surface area and is very charged, so why shouldn't things be there, even if they aren't functionally relevant?"
This, he suggested, means that biochemical purification approaches, which are based on the affinities of various elements, will miss the "many functional, meaningful chromatin co-factors with very low affinities and very high turnover." By the same token, Rappsilber said, such a technique will capture high-affinity elements that may have little functional relevance.
Therefore, instead of defining the organelle components primarily via biochemical purification, the Edinburgh team sought to better account for functional relationships with a machine learning algorithm that they trained using reference sets of well-annotated proteins and then used to determine whether unannotated proteins played a role in interphase chromatin.
They were driven to this approach by several years spent fruitlessly trying to purify the organelle for analysis, Rappsilber said. "We used every thinkable protocol, and just whatever we did we could not get rid of things that you typically think of as contaminants."
This, he said, left him and his colleagues with two possibilities: "We could either decide that we were just incapable of biochemically purifying this organelle, or, we could decide that if we wanted to have the composition of this organelle, then we have to go a different route."
To gather the functional raw data for their analysis, the researchers quantified the proteins in enriched chromatin fractions across 19 different biological conditions including drug treatments, cell cycle phases, and cell type differences, the notion being that such global changes would affect chromatin and non-chromatin proteins differently. They analyzed these fractions – prepared via an in vivo crosslinking process they named chromatin enrichment for proteomics (ChEP) – using SILAC mass spec on a Thermo Fisher Scientific LTQ Orbitrap and LTQ Orbitrap Velos.
"We did lots of biological experiments... and there are patterns of behavior of proteins that we know are associated with chromatin," Rappsilber said. "And we asked: how do they behave? Under which conditions do we have more of them in [the ChEP fractions], [and] under which conditions do we have less?"
"So then you have lots of different profiles of different chromatin factors, and we train the machine learning tool with the profiles of hundreds of known chromatin factors and hundreds of factors that are defined as not chromatin," he said.
In total, the researchers analyzed 7,635 proteins, 1,823 of which had evidence in the literature linking them to chromatin, 3,972 of which were known not to be linked to chromatin, and 1,840 of which had not previously been characterized. Applying the algorithm to these 1,840, the researchers found that 576 of them have a chromatin probability of greater than 0.5, suggesting, they wrote, "that hundreds of chromatin components are presently still uncharacterized."
Rappsilber and his colleagues validated the algorithm's performance using ten-fold cross-validation. He noted as well that the co-classification of certain associated proteins bolstered their confidence in their results. For instance, he said, while the algorithm classified the three members of the condensin-1 complex similarly and the three members of the condensin-2 complex similarly, it showed little overlap between the condensin-1 and -2 proteins.
This is as would be expected, "because the two complexes have fundamentally different roles in chromatin," he said. "So that leads us to believe that these values carry biological meaning."
He added, however, that it is more difficult to actually validate that a protein identified as chromatin-associated is, in fact, part of the organelle.
"Chromatin is a very complex organelle," he said, noting that because interphase chromatin is spread across the nucleus, "you cannot by localization distinguish between just a general nuclear protein and a chromatin protein."
This means that to validate that a protein is chromatin-associated, the researchers would have to "essentially find the process that it is involved in," Rappsilber said, noting that such an effort would be "a PhD thesis, essentially, for each protein."
Instead, Rappsilber said, the researchers applied their classifications to an analysis of chromatin during replication, where, he said, "it was much easier to get localization information because you have localization to replication forks, which is a recognizable structure in the nucleus."
In this work, which Rappsilber said is slated for publication in a forthcoming issue of Nature Cell Biology, the researchers sought to validate their predictions for seven proteins, four of which they identified as chromatin factors and three of which they predicted were not functionally associated. According to Rappsilber their predictions proved correct for all seven proteins.
Saying that there is currently significant dissatisfaction with the state of organelle proteomics, Rappsilber suggested that the latest study might offer a path forward.
"Organelle proteomics [currently] works on the presumption that you can actually purify an organelle and make an inventory and that is the proteome of that organelle," he said. "But what we suggest is that this view certainly does not hold true with chromatin, and I personally believe it does not hold true for any organelle."
"If you ask biologists, many are actually amazingly unhappy about organelle proteomics," he added. "We think we need to move away from this idea of [proteins as these] little balls [interacting] in the cell and rather look at it as a possibility space. I think the way forward is going to be quantitative annotation. So instead of absolute annotation – yes, this is a chromatin protein – it's going to be quantitative – there is an 80 percent chance this is a chromatin protein."
Such an approach is necessary, he suggested, due not only to the imperfect readouts available for identifying organelle proteins, but also due to the lack of precision within biological systems.
For instance, Rappsilber said, "in even the most accurate [cellular] process, DNA replication, there is an error every 1 in 10,000 bases. Any other process is going to have a higher error rate. So organelle proteomics cannot do better than the cell does itself."