NEW YORK (GenomeWeb) – A team led by researchers at Pacific Northwest National Laboratory has devised a Bayesian approach to proteoform modeling that could offer improved protein quantitation in large-scale proteomic experiments.
Described in a study published last week in Molecular & Cellular Proteomics, the approach aims to take into account the multiple forms of a given protein that could exist in a sample, allowing researchers to arrive at more accurate calculations of protein abundance from identified peptides.
The basic notion underlying mass spec-based proteomics is that mass spectra generated by analysis of peptide fragments can be matched to specific peptides and that those peptides can then be assigned to particular proteins, allowing researchers to determine the protein composition of a sample.
The problem, noted Bobbie-Jo Webb-Robertson, a PNNL researcher and first author on the paper, is that "when you get into [analysis of] higher organisms, you have a lot of peptides that map to multiple proteins and map to multiple isoforms of the same protein."
"A lot of the challenge lies on the computational side," she said. "How do you take those peptides and infer the correct protein abundance? How do you know when you have peptides that are pointing to more than one protein?"
The lack of methods to adequately deal with such issues is possibly a key factor underlying proteomics' poor track record to date in biomarker discovery, Webb-Robertson told ProteoMonitor.
Traditionally, researchers have approached the question of assigning peptides to proteins from the standpoint of optimization – finding the minimum number of proteins that explains the peptides identified by a mass spec experiment.
Protein quantitation, then, has traditionally involved "taking all my peptides for [a given] protein, running a linear model or some sort of averaging algorithm, and then just saying that if I put all of that peptide evidence together, there is my protein [quantity]," Webb-Robertson said.
"But when you get into complex disease and you are trying to find a biomarker that may be a low abundance protein or some specific isoform of a protein, that strategy doesn't really work anymore," she noted.
In an effort to improve protein quantitation, the PNNL researchers developed an approach in which they generate a statistical signature for each individual peptide. They then examine how frequently they observe that peptide's signature and use this frequency information to determine the number of different proteoforms of a given protein that are likely present.
"If I see [a given peptide] signature more often that I would expect to by chance, then I infer that that means I have two forms of that protein that are being expressed, or more than that," Webb-Robertson said.
She noted that recently other groups have been attacking the problem from a different angle, by taking the peptide level data, then correlating the different peptide levels and using clustering approaches to determine how many different proteoforms are likely present.
In the MCP paper, the researchers compared their Bayesian approach to such a correlation and clustering method – protein quantification by peptide quality control (PQPQ) – and found that while the two methods had essentially equal sensitivity in terms of identifying proteoforms present, the PNNL group's approach demonstrated better specificity, resulting in fewer false positive proteoform identifications.
Webb-Robertson said that ideally her team would like to combine the two methods, noting in particular that correlation and clustering methods like PQPQ offer a potential corrective to the PNNL approach by countering the notion that peptides are independent of one another – an assumption built into their method.
She said that she has begun efforts to integrate the two approaches and that while she doesn't yet have "validation that my intuition is right," she nonetheless believes that the field will move toward integrating methods that will "account for both peptide correlation and statistical signatures."
More generally, Webb-Robertson said she sees a move within the field toward a greater appreciation of the need to better take into account protein proteoforms when doing large-scale proteomic analyses.
In July, a team led by Northwestern University researcher Neil Kelleher published a paper in Journal of Proteome Research presenting what the authors called a "Bayesian framework" for improving proteoform identification in top-down proteomics.
Proteoform identification and quantitation has traditionally been a key goal of top-down proteomics – indeed, it has been a primary factor driving interest in the field. Bottom-up researchers are increasingly concerned with the issue as well, though, Webb-Robertson said.
"I don't think you'll see many more papers on just doing standard protein quantitation by just adding a new covariate to a linear model, because I think reviewers will bring up the fact that you have no way to account for which proteoform a peptide belongs to," she said, adding that she believes approaches like that presented in the MCP paper will be the standard in four or five years.
She said that she and her colleagues are currently using the method in a number of biomarker discovery projects she is working on. She added that they hope to make the first release of the software available within the next six months.