NEW YORK – A group of prominent top-down proteomics researchers have proposed a new initiative to map the human proteome and the level of individual proteoforms.
Described in a paper published last week in Science Advances, the so-called Human Proteoform Project would generate a catalogue of the different protein forms, including genetic variants and post-translationally modified forms, present in various mammalian cell lines and human cells.
The authors proposed as a goal for the project characterizing 5,000 cell types at a depth of 1 million proteoforms per cell, making for a total measurement of roughly 5 billion proteoforms, with an estimated 50 million of those being unique.
Such a project would represent an ambitious expansion of existing projects aimed at cataloging the human proteome, most notably the Human Proteome Organization's (HUPO) Human Proteome Project (HPP) and the Human Protein Atlas led by researchers at the Science for Life Laboratory at the Karolinska Institute and Royal Institute of Technology Stockholm.
These efforts have largely focused on identifying evidence of proteins produced by the roughly 20,000 protein-coding genes in the human genome. For instance, as of its 2020 progress report, the HPP had confidently detected 17,874 proteins, accounting for more than 90 percent of the known protein-coding genes in humans. Most, if not all, of those 17,874 proteins, however, exist in a variety of forms, featuring slightly different amino acid sequences due to splice variants, or different lengths due to truncations, or different combinations of post-translational modifications.
These various forms are known as proteoforms, and the presence and proportion of different proteoforms within a cell are key to all manner of biological processes, influencing things like protein localization or protein-protein interactions or cell signaling. To fully understand the role proteins play in different aspects of biology and disease, it will likely be necessary to understand not just which proteins are expressed under different conditions, but which specific proteoforms are present, as well.
Confidently measuring proteoforms at proteome-scale is a daunting challenge, however. Proteomics is just now, after more than two decades of research and technical development, reaching the point where experiments are able to detect proteins to most protein-coding genes. And while there has been extensive research into certain specific kinds of proteoforms — phosphorylated proteins, for instance — study of proteoforms has generally lagged behind.
Additionally, the bottom-up proteomics workflows used by most researchers in the field are poorly suited to proteoforms characterization. In bottom-up experiments, proteins are digested into smaller peptides prior to mass spec analysis. Measured peptides are then linked back to the proteins they came from to allow for protein identification and quantification. However, because different modifications can reside in different peptides across a protein, it is difficult and, in many cases, impossible, to piece bottom-up peptide-level data together to provide an accurate picture of the proteoforms present in a sample.
As opposed to bottom-up workflows, top-down proteomic experiments look at intact proteins. Because in these experiments researchers are not digesting proteins in smaller peptides, they are able to look at the full length of proteins with all their various modifications, making it an ideal approach for characterizing proteoforms — the likely reason the proposal for the Human Proteoform Project has come from a group of top-down proteomics researchers.
Top-down proteomics is much more technically challenging than bottom-up efforts, however, with experiments typically maxing out at around several thousand proteoforms identified and with much lower throughput. The Proteoform Project, then, would require not only significant funding and work but also, as the Science Advances authors note, substantial technology development, including improvements in mass spectrometry as well as "nanopore sequencing, cryoelectron microscopy and visual proteomics, single-cell proteomics, single-molecule protein arrays, and other ideas yet to be conceived."
Neil Kelleher, director of the Chemistry of Life Processes Institute at Northwestern University and an author on the Science Advances paper, acknowledged the size of the task he and his colleagues have set for themselves and the field but suggested that a major effort to more thoroughly characterize the proteome at the proteoforms level would pay substantial dividends across biomedicine.
"If in fact we could determine proteins with complete molecular specificity and devise a reference atlas and improve measurements to do that, then it will raise the game for all of biomedical research," he said.
For instance, Kelleher said, having a catalogue of all the proteoforms in a cell would prove a boon for bottom-up researchers trying to piece together proteoforms profiles from peptide-level information by limiting the space of potential combinations and modifications they would have to consider when analyzing their data.
Kelleher suggested that a number of emerging proteomic technologies would benefit from such a resource, as well. Seattle-based Nautilus Biotechnology, for instance, is developing a proteomic platform that uses machine learning combined with iterative reads by multiple semi-specific affinity reagents to make protein identifications based on the different patterns of affinity binding observed. As with bottom-up mass spec, a resource like the Proteoform Project could help a platform like Nautilus' by better defining the possible set of proteoforms it was making identifications against.
"The way Neil states it, and I think that this is a great way to think about it, is that if you are trying to decide among every one of the infinite numbers of possibilities [for proteoforms], it is a really big search space," said Parag Mallick, founder and chief scientist of Nautilus. "And we know that there are a lot of proteoforms. But it is different to say we know there are 100 proteoforms for this particular protein than to say there are 100 billion possible proteoforms."
Given the variety of modifications and combinations that exist, the number of theoretically possible proteoforms is essentially limitless. In practice, though, the number would seem much more constrained. For instance, speaking to GenomeWeb several years ago, Kelleher noted findings from a study by his team that indicated post-translational modifications like acetylation were much rarer than the literature would suggest.
"At least we have some data, to kind of say, 'Well, OK, it's not like these proteins all have hundreds of abundant proteoforms and it's super complicated,'" he said at the time. "It's not that bad, and so we can actually do the whole proteome project, which is my dream. We can catalog them all because there's not an infinite number of them."
"The Proteoform Atlas would be really helpful in helping us to say the likelihood of that actual combination [of modifications] actually existing is staggeringly low," added Mallick, who has co-authored a paper with Kelleher exploring likely estimates for the number of proteoforms present in the human proteome. "So I'm totally on board with what Neil is talking about."
Andreas Huhmer, global marketing director for mass spectrometry solutions at Thermo Fisher Scientific, said that he saw an uptick in top-down proteomics presentations at the recent American Society for Mass Spectrometry annual meeting, noting that he thought people were giving top-down a closer look "simply because there is so much biology hidden in the actual proteoforms itself."
"The protein identification challenge is largely solved, and you can now get coverage of whole proteomes very quickly, but the biology requires that you look at proteoforms in more detail," he said.
"Proteoforms are very important, because they have different functions," said Rohan Thakur, executive VP of life sciences mass spectrometry at Bruker. "The same protein with a different proteoform has a different function. So I think it's important to map, definitely."
Whether the funding is there to drive such an effort is uncertain, though. In their Science Advances paper, Kelleher and his colleagues suggested that on the order of billions of dollars would be needed to fund the project. That would well outstrip the funding provided to other major proteomics programs. For instance, the National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium recently launched its fourth stage, which is being funded at a level of around $11 million per year.
"It needs to be funded so that academics can go after it and start supporting this kind of vision," Thakur said. "But if your funding remains limited, why would anyone switch what they are doing successfully today to embark on a project like this?"
"If funding agencies start saying, OK, we need mass spectrometers specifically for proteoforms analysis, and there is a pool of money available to the research community, that is how this thing will get cracked," he said.
Gil Omenn, a professor at the University of Michigan who has held a number of leadership positions within HUPO over the years, said that given the challenges proteomics has faced in getting funding generally, "it was a little hard to be optimistic."
"We've struggled for a long time to get proteomics higher on the agenda at the [National Institutes of Health]," he said.
He suggested the sheer scale of the challenge could make funders and funding agencies reluctant to take it on.
When "we point out all the proteoforms, we are talking about hundreds of thousands, millions of different molecular species," he said. "People glass over when they hear numbers like that. They don't know how to deal with it. We think it's something that modern data science is primed to do … [but] the scale is daunting to most folks and most funders. They want to have projects that they understand and where they see a clear path to success and good metrics for success."
Kelleher suggested that one route by which the effort might make progress would be to participate in some of the many ongoing single-cell biology projects like the NIH-funded Human BioMolecular Atlas Program.
"There are a lot of activities [around single-cell biology], and I believe that as those continue and as they learn about the proteoform opportunity, that that represents an option to fold into those models and consortia, whether they be public or philanthropic," he said.
Whether or not the Human Proteoform Project as Kelleher and his colleagues envision it comes into being, Huhmer said he believes its focus on proteoforms points toward the future direction of proteomics.
"I think this is where the field is really going, toward understanding more about the function of the protein," he said. "I think you're going to see a pivot in this field, where in the future people will spend a lot more time doing proteoform analysis."
Omenn agreed. "I think the field is ready to move on and put much more emphasis on function and biology and systems biology and systems medicine," he said. "So all of these ideas are timely."