Researchers from Cellzome, the Institute for Systems Biology, and the University of California, Los Angeles, have developed a computational method that could help proteomics evolve from a qualitative discovery tool into more of a quantitative assay platform — a longtime goal of proteomics practitioners.
The approach, published in this week’s Nature Biotechnology, is based on the experimental finding that certain peptides are consistently identified in mass spectrometry experiments, while others are hardly detected at all.
The researchers determined that one or a few of the commonly observed, or “proteotypic,” peptides could uniquely identify any given protein — a possibility that could vastly simplify many proteomics experiments.
Using a database of yeast proteomic data containing more than 4,000 proteins and 600,000 peptides, the researchers first classified all the peptides into two groups — proteotypic and nonproteotypic — and then evaluated around 500 biophysical properties, such as charge, secondary structure, and hydrophobicity, to determine those characteristics that best distinguished the set of proteotypic peptides from the set of unobserved peptides.
The properties with the most discriminating power were used to develop a computational tool that could predict proteotypic peptides directly from sequence data with greater than 85 percent cumulative accuracy.
Bernhard Kuster, vice president of analytical sciences and informatics at Cellzome and a co-author on the paper, likened the approach to that of gene prediction algorithms, which are “built essentially on what people have learned from sequencing genes, and you apply it to the human genome and you try to predict the exons that would be translated into a protein.”
Kuster noted that proteomic databases are just getting large enough to provide the kind of information that can guide development of robust predictive methods, and said that the proteotypic peptide predictor is an early example of what is likely to be a growing discipline.
“We expect that the exponential growth of proteomics data repositories will enable the further refinement of predictors and enable development of predictors for experimental designs not covered in this study,” the authors wrote in the Nature Biotech paper.
The predictors are publicly available through the ISB’s Seattle Proteome Center. “We expect that the concept will catch on, and people will become creative and potentially improve them because they have other clever ideas on how these algorithms could be improved,” Kuster said.
The paper notes that the predictors worked just as effectively for the human genome as they did for yeast — a finding that Kuster described as one of the more “exciting outcomes” of the study because it proved that the method could be used to predict which proteins to expect in a proteomics experiment “even though you had never analyzed that protein before.”
Based on the performance in human, “we extrapolate from this that there should really not be any issue with applying this to any other species because it appears that the utilization of amino acids and the biophysical processes that determine the sequence in order to bring about function seems to be very conserved in nature,” Kuster said.
Kuster described a number of potential applications for the study’s findings. For example, proteotypic peptides could be used as a “data QC validation tool” to weed out false positives in experimental data sets.
“We expect that the exponential growth of proteomics data repositories will enable the further refinement of predictors and enable development of predictors for experimental designs not covered in this study.”
Proteotypic peptides could also be used for absolute protein quantification. “It turns out that when you focus the analysis on proteotypic peptides, you can use mass-spec techniques to approximate the absolute abundance of proteins in a much more reliable way than by sequencing just any odd peptide of any odd protein,” he said.
The primary goal of the work, Kuster said, “is to take the whole proteomic workflow from a descriptive, empirical, qualitative discovery platform into more of a quantitative biological science, and ideally do that on a genome-wide scale.”
One challenge for future work in this area lies in post-translational modifications. Kuster said that it might be possible to extend the existing method to help predict glycosylation or phosphorylation. As an example, if a predicted proteotypic peptide is absent, it could be a sign that the protein was modified.
However, Kuster stressed that post-translational modifications are the “next frontier” for proteomics. “It’s not a done deal because the base of data — of reliable, empiric physical data for training models along the lines of post translational modifications — does not exist yet,” he said.