Researchers at Texas A&M University and the University of Texas have developed an algorithm that they claim is able to detect weak peptide signals that other methods miss.
The Bayesian peptide detection algorithm, or BPDA, addresses a key step in the protein analysis pipeline for mass spectrometry-based studies — the conversion of raw spectra to a list of peptide masses.
In a paper describing the method that was recently published in BMC Bioinformatics, the authors write that BPDA "evaluates all possible combinations of peptide candidates to interpret a given spectrum and iteratively finds the best fitting peptide parameters" in order to identify more peptides than other approaches with fewer false positives.
The authors note that current peptide-detection methods — so-called template-matching algorithms — "only employ isotopic distributions and work at each single charge state alone," while BPDA "takes into account the charge state distribution as well, thus lending information to better identify weak peptide signals and produce more robust results."
Jianqiu Zhang, an assistant professor at the University of Texas and one of the co-authors on the paper, told BioInform that the team saw an opportunity to “improve the performance of protein quantification in terms of improving the coverage and sensitivity and the accuracy in protein identification, which is critical for biomarker discovery.”
In the paper, the authors note that current peptide-identification tools — like the Seattle Proteome Center's PepList; the Fred Hutchinson Cancer Research Center's msInspect; Pacific Northwest National Laboratory's Decon2LS; and the open source project OpenMS — rely on templates based on predicted isotope patterns. These tools work on small regions of the spectra at a time and determine whether a cluster of peaks matches a proposed template. If so, it is reported as a feature and subtracted from the spectra.
These methods have problems with overlapping peptides, however. In such cases, "if the peak cluster of one peptide is incorrectly matched and subtracted, the rest of the peptides can not be detected correctly based on the remaining spectrum, which will cause error propagation," the authors note.
Ulisses Braga-Neto, an assistant professor at Texas A&M University and one of the paper's authors, described the template-matching approach as trying to see if a “square peg fits in a round hole,” and noted that the approach "tries to force a match by looking locally at the data."
In addition, Zhang said that local approaches like template-matching don't account for the fact that a single peptide can register several peaks in different areas of the spectrum.
"If you try to detect one cluster at a time …you do not consider other clusters, resulting in high false positive rates [because] you might attribute the [clusters] to multiple [peptides],” she said.
The authors — who are all electrical engineers — decided to take a global approach, which is more common in electrical engineering research, Braga-Neto said. “Here we are talking about signal detection in noise and this is an approach that [electrical engineers] have expertise in.”
Zhang agreed, noting that electrical engineering is "the meeting ground" between practical and theoretical research. “On one hand we are trained to look at real data models and we care very much whether our model fits the real data … we also care whether our algorithm is near optimal.”
As a result, BPDA adopts a “global approach based on Bayesian single processing" where "a model for the whole spectrum is developed and both isotope patterns and charge state distributions of peptides are considered."
BPDA involves three separate steps. First, it generates a list of peptides from one-dimensional MS data by eliminating baseline spectra that could interfere with the peptide signals. The software uses a Matlab function called mspeaks to detect peaks and then applies an equation to identify potential peptide candidates for each peak.
The peptides suggested in the first step are then used to develop a model that "considers peaks at different isotopic positions and charge states simultaneously for each peptide candidate, incorporating candidates' existence probabilities and the spectrum thermal noise."
As a final step, Zhang said the team uses a Gibbs sampling algorithm for parameter estimation. "We first estimate one group of parameters and then we take it as if they were true and then we plug in the conditional probability and then we estimate the next group of parameters so the algorithm iteratively updates the estimate on all the parameters and eventually it reaches the optimal solution."
By contrast, she said, other algorithms “estimate a set of parameters in one shot,” so if the first estimate is wrong, “subsequent estimations will be wrong.”
As described in the BMC Bioinformatics paper, the team compared BPDA to several commercial and open source peptide-detection packages, such as OpenMS and Decon2LS, using both real and synthetic data. In tests with synthetic data, the authors reported that BPDA outperformed OpenMS in terms of abundance levels and detecting overlapping peptides.
For example, the authors claim that in one test, BPDA detected all 10 peptides at a false positive rate of 0.1 and accurately identified nearly all the charge states for each peptide, while at the same false positive rate, OpenMS could only detect a few peptides and half of the charge states.
In real data generated on a MALDI-TOF machine, BPDA successfully detected six out of seven peptides, while Decon2LS, OpenMS, and flexAnalysis missed two peptides each.
In a test that looked at protein coverage using horse myoglobin, Zhang said that for the top ten percent of detected peptides, BPDA had 77 percent coverage, OpenMS had 40 percent, and Decon2LS had below five percent.
The authors conceded that while BPDA is more accurate, it is more computationally intensive.
For example, OpenMS took one minute to analyze the 10-peptide synthetic data set while BPDA required half an hour.
“We trade off in the computation for more sensitive and accurate results,” Braga-Neto said, adding that although BPDA takes longer, “computation is always a moving [target] because you [get] faster computers all the time."
In an effort to reduce the computational complexity, Zhang said the team "decoupled data groups that were uncorrelated [and] grouped together correlated data clusters."
She noted that the grouping approach will also “facilitate BPDA's implementation in parallel processing” further down the road, she anticipates about a year, when the algorithm is incorporated into a software package and “[will] guarantee the same performance but computational time can be cut significantly.”
She also said that the team plans to convert the Matlab code used to develop the algorithm into both the C and R programming languages, and is also looking at the possibility of implementing BPDA in a cloud environment.
The group's first step, however, will be extending the algorithm from analyzing one-dimensional data to incorporate two-dimensional spectra, she said.