NEW YORK – A group of prominent proteomics researchers has launched an effort to encourage increased study of uncharacterized and understudied proteins.
With support from the Wellcome Trust, the group has proposed forming what it calls the Understudied Protein Initiative, an effort that will first seek to develop metrics for defining what proteins are, in fact, understudied, and then use functional proteomics to facilitate research of those proteins.
Animating the project is the observation that scientific research overwhelmingly focuses on a relatively small proportion of the human proteome. In their Nature Methods commentary, the authors observe that 95 percent "of all life science publications focus on a group of 5,000 particularly well-studied human proteins." Furthermore, many proteins remain uncharacterized despite evidence linking them to diseases or essential cell processes. For instance, the researchers note, of the roughly 1,900 proteins key to proliferation in human cell lines, more than 300 are uncharacterized.
To an extent this reflects technical limitations. While the coverage of both mass spectrometry-based and affinity reagent-based assays have expanded rapidly in recent years, a number of proteins remain out of the reach of such approaches. Roughly 10 percent of the proteins predicted to exist in the human proteome remain undetected.
Juri Rappsilber, senior author on both commentaries and a professor of proteomics at the University of Edinburgh and professor of bioanalytics at the Berlin Institute of Technology, suggested, though, that a lack of reliable functional information is perhaps the primary reason many proteins are neglected.
He offered the example of a doctoral student looking at a set of proteins potentially linked to their lab's main area of study and trying to decide which one they should pursue for their thesis.
"Of course you're going to take the one where there is already some idea of what it does," he said.
In the first place, it's likely that reagents will already be available for studying such a protein, Rappsilber said. "You can order an antibody, there are knockout cell lines, there are tools."
Even more important, having some mechanistic or functional information about the protein gives a researcher a higher likelihood of success when designing an experiment — no small consideration for a graduate student or post-doc who needs to publish to advance their career. Understandably, funding agencies are also more likely to provide money for research into proteins for which there is some functional information showing why they are of interest, Rappsilber noted.
"Everything in science is output oriented, and of course you look at what is the best starting position you can assume in order to generate some output," he said. "And when you take, say, a dozen unknown proteins where you don't even have a clue of what assays to do with them, it's not a very wise decision for anyone — not for a funding agency, not for a supervisor, not for a student."
The key to increasing the number of proteins receiving serious study, then, is to increase the number of proteins for which there is good functional information that can serve as a solid starting point for a researcher, Rappsilber said. In theory, proteomic methods, which are able to rapidly generate large quantities of data on large numbers of proteins, are well suited to this role. That has become increasingly true as the field has in recent years focused more on areas like protein-protein interaction and co-expression research, seeking not only to identify individual proteins in samples but to understand how they behave within complexes and networks and different biological contexts.
In practice, however, there remains a disconnect between the proteomics community generating this kind of data and the molecular biology community that typically does the deep dives into a small number of proteins, Rappsilber said. Many on the molecular biology side aren't aware of the data being generated on the proteomics side, and, many times when they are, they don't trust it.
"Maybe there is data out there in the appendix of some large [proteomics] study that would be interesting [to a molecular biologist], but who looks at this data?" Rappsilber said. He added that exploring and evaluating proteomic data and protein interaction data, especially at the level of raw data, takes expertise that is still spreading outside the proteomics community and into the broader molecular biology field.
"It's knowledge that only makes it slowly into laboratories," he said.
He added that more transparent accuracy metrics on the proteomics side could also improve the situation.
"If you look at large-scale pull-down studies, for example, we don't really know how successful they are," he said. "It's a lot of effort and a lot of data, for sure. But then you speak to a molecular biologist to see if they are using this data, and they say, 'I looked at my protein, and I know it behaves differently. I know you need this specific condition, this specific detergent for the pull-down to work, and I looked at the stuff they pulled down with that protein and it doesn't really make sense.'"
"When you are a molecular biologist, you have to be suspicious of everything, because it is so much time and effort you put into your leads that you would rather err on the conservative side," Rappsilber added.
The project's goal, he said, is to generate enough information around different understudied proteins to make researchers more likely to deem them worthy of in-depth investigation.
The Understudied Protein Initiative's first step is a survey, currently ongoing, where the project's researchers are presenting respondents with randomly selected human proteins and asking them how well annotated they believe they are as well as what resources and considerations they used to make their assessment. The goal of the survey is to assess what kinds of data distinguish poorly annotated proteins from well annotated ones in the minds of researchers. This information will allow researchers both to determine what information needs to be generated on understudied proteins in order to foster more investigations and to track the progress made in generating that data, Rappsilber said. He added that the group believes it needs on the order of several tens of thousands of responses and hopes to have this data by spring 2023.
Following collection and analysis of the survey, Rappsilber and his colleagues plan to host a conference to discuss the results and develop ways to provide the information indicated by the survey as necessary.
Rappsilber said the Wellcome Trust has agreed to fund the computation analysis of the survey data and the meeting.