This story originally ran on Aug. 5.
By Tony Fong
With the vast amount of data being generated by proteomics research, particularly shotgun proteomics, the ability to effectively determine the accuracy of protein identifications is gaining urgency.
In a recently published study, researchers from Switzerland, including Ruedi Aebersold, describe a generic strategy they said will allow researchers to evaluate their confidence levels in protein identifications and decide which ones should be carried through to further verification and validation studies.
According to the researchers, their approach, called MAYU, is the first "to quantify the uncertainty of protein identifications in the context of large-scale datasets, thereby allowing [one] to automatically curate proteomics repositories of steadily increasing size." They add that the approach "will significantly enhance genome-wide studies based on shotgun proteomics strategies."
In the study published July 16 on the online edition of Molecular and Cellular Proteomics, they write that a "fundamental goal" of proteomics is to map out the proteomes of various organisms, and with recent advances in mass spectrometry and the employment of shotgun proteomics, a tremendous amount of data has resulted.
"The volume and heterogeneity of proteomic data required to substantially map out a proteome pose considerable challenges to assess the confidence of peptides and proteins that are inferred from the collected fragment ion spectra," the researchers said.
While approaches for protein identification exist, they provide "reasonable to good error estimates" only for individual experiments but cannot "reliably [quantify] the confidence in protein identifications in very large, integrated datasets" of typically 100 or more LC-MS/MS runs, the authors said. "To date, protein identifications in large proteomic datasets have been compiled according to heuristic criteria for which so far no quantitative confidence measures like [false-discovery rates] have been derived at the protein identification level."
Their solution MAYU — named for the Japanese word for cocoon — is a generic, computational strategy that builds on the target-decoy strategy used to estimate false discovery rates at the peptide-spectrum match, or PSM, level. Under a target-decoy strategy, acquired fragment ion spectra are searched against a chimeric protein database containing all the target protein sequences contained in the sample being analyzed and the corresponding reversed, or decoy, protein sequences. Based on the PSMs mapping on the decoy sequences, an expected fraction of false-positive assignments can be derived to measure the reliability of the PSM.
In an interview with ProteoMonitor, Lukas Reiter, the co-first author of the study with Manfred Claasen and a PhD student at the Institute for Molecular Systems Biology in Zurich, said that the motivation behind the work was to investigate the so-called "one-hit wonders" in proteomics experiments — single peptides detected in experiments but not connected to a complete protein — and their error rates. It had long been suspected that these single hits had higher error rates.
He and his colleagues developed MAYU using a target-decoy strategy to estimate how many proteins are mapping to a decoy database to estimate the number of false-positive searches.
"And this number is just a statistical estimate that we can use afterward in a hypergeometric probability model to estimate the number of false protein identifications," he said. "You have a total number of protein identifications, you have a total number of protein identifications that contain false search hits, and [by putting] these two together, you can estimate the number of false protein identifications.
"The special thing about MAYU is not that it can estimate the reliability of search hits but [that] it can estimate the reliability of protein identifications," Reiter added. "To accomplish this, it does not count the PSMs on the decoy database but the protein identifications on the decoy database," or the protein identifications containing false PSMs. "This number is then used to estimate the error rate of protein identifications using a hypergeometric probability distribution."
[ pagebreak ]
The researchers validated their MAYU approach in three ways. In the first, they assessed the "robustness of the FDR estimates under violations of the underlying assumptions." They also compared their MAYU FDR estimates with an approach that estimates single PSM protein identifications, or single-hit, FDR based on isoelectric point information from an isoelectric focusing experiment using 67 LC-MS/MS runs and a C. elegans dataset.
In both cases, they found the MAYU strategy held up to scrutiny.
Lastly, they validated MAYU FDR estimates by confirming single-hit FDR using synthesized peptides "corresponding to single hits" in the largest dataset for C. elegans available.
For this last validation approach, they used synthetic peptides to compare their tandem mass spectra against the tandem mass spectra from the C. elegans dataset. The research team reported that the fraction of false positives for peptides of interest, .49, was consistent with the estimated .47 of MAYU, and concluded that MAYU's estimates are accurate in the context of a very large dataset.
Using this dataset, they also evaluated how the size of a dataset affects protein identification FDR. They sub-sampled the dataset — looking at almost 5.9 million tandem mass spectra generated in 1,350 LC-MS/MS runs — into 20 data units of increasing size, estimating for each of the units the FDR of the protein identifications defined for varying PSM FDR cutoffs.
They discovered that as the dataset increases in size, protein identification FDR scales up proportionately, with the discrepancy in FDR rising to a more than 20-fold difference.
"And the reason for this is that … you have in these very, very large shotgun proteomics datasets, large redundancy," Reiter said. "You really identify some proteins several thousand times."
He and his colleagues suggest that in order to attain "acceptable" protein identification FDR, "PSMs have to be selected exceedingly stringently with increasing dataset size."
They also compared MAYU with two other FDR estimate approaches, ProteinProphet and a naïve target decoy. They found estimates derived with ProteinProphet to be too "optimistic," while estimates with a naïve target decoy approach to be too "pessimistic."
While smaller datasets of up to 50 LC-MS/MS runs, they said, yield "reasonable" protein identification FDR estimates, scaling up dataset size increased the discrepancy between MAYU's estimates and the estimates from ProteinProphet and the naïve target decoy strategy.
Though MAYU's greatest value may be its application on very large datasets, Reiter said that even on datasets of 50 LC-MS/MS runs, "you can already see significant elevation of these error rates, about five times. … So if you really want to have full control of the quality of your dataset, I would suggest to also use it in mid-sized datasets."
MAYU also works for very small datasets, but the difference in error rates compared to other methods is minor, he said.
And though the MCP study focused on false-positive protein identifications, Reiter said that MAYU can also be used for analyses of false-negative identifications. In large datasets, lowering the filtering criteria does not change the number of true protein identifications.
"This means in this dataset that when you lower your cutoff, the reason you see an increase in protein identifications is mainly because false protein identifications are accumulating," he said. In these large datasets that already have stringent cutoffs, the maximum number of protein identifications have already been attained.
"Because of the asymptotic behavior of the estimated true protein identifications, it is also possible to estimate the total number of true protein identifications in the data set and therefore the false negatives," Reiter said.
MAYU is publicly available for download here and is implemented in the Trans-Proteomic Pipeline.