NEW YORK – Proteomics researchers are looking to improve analyses of peptide-level data in order to achieve better protein-level quantitation.
The hope is that by making better use of these peptide measurements, scientists will improve the accuracy of the protein measurements commonly reported as the output of proteomic experiments and uncover biological signals that are lost using conventional approaches.
The trend addresses complexities that are widely acknowledged as challenges for bottom-up proteomics but that the field has tackled only fitfully, said Michael MacCoss, professor of Genome Sciences at the University of Washington. MacCoss recently co-authored with a number of other proteomics researchers a commentary published as a bioRxiv preprint looking at the question of, as he wrote, what protein quantitation means in bottom-up proteomics.
The question stems from the nature of bottom-up proteomics, in which proteins in a sample are digested into smaller peptides prior to mass spec analysis. This digestion step makes proteomics experiments more technically tractable, as LC-MS/MS experiments analyzing peptides at proteome scale are significantly simpler than experiments analyzing intact proteins across an entire proteome. It presents, however, the problem of how to take peptide-level data and roll it up into protein identifications and quantities.
MacCoss and his co-authors noted that while researchers have developed a number of approaches for turning peptide measurements into protein data, most work under the assumption that peptides from the same protein will behave the same way.
In practice, though, that isn't the case. On the one hand, there are a number of technical reasons why two peptides from the same protein may not behave the same way. For instance, different digestion efficiencies could lead to some peptides being more abundant than others. Different ionization efficiencies could similarly make one peptide more likely than another to be detected by the mass spec.
Then there are the biological considerations. Each protein coded for by a particular gene exists in multiple forms. Proteins feature alterations like amino acid variations, post-translation modifications, and truncations, leading within the body to not a single protein species per gene but a collection of different "proteoforms." The presence and relative abundance of these different proteoforms has been shown in a number of cases to have important biological significance, including implications for various diseases.
The presence of these different proteoforms also contribute to different behavior of peptides from what might appear to be the same protein. For instance, if a protein is present in both a full-length and truncated form, expression changes observed in the full-length form wouldn't be observable if the peptide being measured wasn't present in the truncated form. Not only would this throw off protein-level quantitation, but it would also mask relative changes in the two protein forms that could be biologically important.
MacCoss and his co-authors offered the example of amyloid-beta peptides, which come from the amyloid precursor protein and have been linked to Alzheimer's disease. Aβ40 and Aβ42 (consisting of 40 and 42 amino acids, respectively) are the most commonly studied forms in the context of Alzheimer's, but more than 20 other Aβ forms have been detected in the brains of Alzheimer's patients. Additionally, shortened variants of the amyloid precursor protein are known to contribute to the amyloid plaques characteristic of patients with Alzheimer's. Rolling all of these peptide measurements up into a single amyloid precursor protein quantity would obscure much of the biology involved in the disease.
Peptide-level differences caused by these sources of variation show up in proteomic data, but to date, many approaches have sought to treat them as outliers and exclude them from calculations aggregating peptide measurements to protein quantities, said Deanna Plubell, a graduate student in MacCoss's lab.
"I think most people have treated them as noise to be filtered out," she said.
Lukas Käll, a professor at KTH Royal Institute of Technology, said that perhaps one reason researchers have paid relatively little attention to these questions around bottom-up quantitation is that the field "has been focusing on identifying peptides."
"That is a hard enough problem by itself," he said. "And it has kept us busy for a long time, and I think this has meant people haven't really had time to think about quantification as much. We needed to have good enough identification data before we could start thinking about this at all."
Käll and his colleagues have been using factor analysis to look at how peptides from the same protein vary. In this work they have found one group of peptides that appear to co-vary nicely.
"An increased concentration of the protein means that [these peptides] will also increase in concentration, and they are very orchestrated so that you can see them going up and down together," he said.
Then there were other sets of peptides that also showed co-variation, but in patterns different from the larger group.
"The question was, why is that?" he said. "And the answer we came up with is that there are multiple isoforms present, so the fact that you see maybe three peptides co-variating one way and four other co-variating another way means that there are probably two different isoforms behind the scenes there."
Käll said that he is currently working on refining approaches for clustering peptides to better understand which belong together.
"We can definitely tell when there is differential abundance [of multiple proteoforms] and we have a good way to flag the situation," he said.
Bobbie-Jo Webb-Robertson, chief scientist computational biology at Pacific Northwest National Laboratory, has also been working on tools to make better use of peptide-level data in protein quantitation.
"People have sort of been alluding to this [challenge] for decades, and there have been a few solutions put out in the literature, but given the scale and the diversity within a complex proteomic samples, none of the tools have really been able to be used broadly," she said. "So if you want to address the challenge of proteoforms it tends to be a piecemeal solution each time."
She said that she believed that the "mechanistic models that you build from quantitative proteomics have sort of hit their limits until we can start to understand all the complexity that is hidden within those proteome data sets."
Webb-Robertson is using Bayesian statistics to assign probabilities that peptide data is coming from one or more proteoforms and that peptides are associated with different proteoforms.
She said that she hopes researchers begin incorporating more biological data to inform such analyses.
"That's what's missing right now," she said. "What isn't integrated into any of these models is, [for instance], where the peptide sits on the sequence, if a peptide is actually independent of other measured peptides,… any knowledge about post-translational modification sites on a peptide."
"That's where the future really needs to go," she said. "We need to really do some sort of bio-informed machine learning to identify which peptides actually would be properly assigned to a particular proteoform or splice variant or whatever protein form it might be."
Webb-Robertson said that better ways are also needed to present mass spec users with data that reflects the complexity of the peptide-level measurements in a usable way.
At UW, Plubell has been working on visualization tools to help her and her colleagues get a better handle on such data.
"We can visualize how peptides are being quantified across conditions on the individual peptide level across the gene with known domains and [post-translational modifications]," MacCoss said. "And that allows you to start to understand why the peptides may be different and has been very helpful in trying to understand how to interpret these peptides."
MacCoss added that the move within proteomics towards higher throughput and larger sample sets spanning multiple conditions should help researchers better tease out some of the biological signal lost by existing methods.
"Historically, a lot of experiments were just cases and controls, but the more conditions you have where those controls are expanded to other disease types, the more likely you are to see a subset of the peptides changing differently from the rest of the peptides and the more likely you are to find things that co-vary," he said. "If you just have two conditions, you basically get peptides either not changing or going up or going down, but if you have more than two conditions you can start to see some complex co-expression patterns."
Top-down proteomics researchers have long highlighted the challenges that proteoforms present to bottom-up workflows. Neil Kelleher, a professor of chemistry at Northwestern University and a leading top-down researcher, said that while getting good sequence coverage remains a significant challenge for bottom-up proteomics, for proteins where bottom-up data provides good coverage of the multiple peptides comprising them, researchers could potentially detect the impact of different proteoforms on the biological questions they are studying.
"You've minimized the variation in the measurements, you've improved the sensitivity and reproducibility of bottom-up proteomics, so you can start to detect all the peptides that are variants and not throw them away," he said, adding that he thought it was key for the bottom-up community to build tools to do that.
He suggested that one way forward could be to interpret bottom-up data in the context of top-down data. He cited the example of phosphorylated tau, which he noted bottom-up data has shown to have as many as 55 different phosphorylation sites, meaning that the number of theoretically possible tau phosphoforms is 2^55.
"That's an insanely large number," he said. "If you said for even just that one gene, I need to infer proteoforms from bottom-up data—yeah, good luck, it's too big a number."
However, Kelleher noted, the number of actual proteoforms in nature is far fewer than the number that could exist theoretically. Using top-down methods to catalogue commonly seen proteoforms and then infer proteoforms into this catalogue from bottom-up data could make for a more addressable challenge, he suggested.
Last year, Kelleher and a number of other top-down researchers announced the launch of the Human Proteoform Project, which aims, as they wrote, "to generate a definitive reference set of the proteoforms produced from the genome."
"In my mind we are still kind of in the hypothesis generation mode," Plubell said. "We can see these peptide differences and we have this biological background to say, oh, we think this peptide is changing because there's a known modification site or something, but it still definitely requires validation and follow-up, as is the case with most mass spec analysis."