NEW YORK – Researchers at the University of Washington have developed a machine-learning approach that could improve analysis of small-scale proteomic datasets.
The method, detailed in a paper published this month in the Journal of Proteome Research, could prove useful for emerging applications like single-cell proteomics where researchers are often working with small numbers of high-quality peptide identifications, said William Noble, professor of genome sciences and computer science at UW and senior author on the study.
Noble is one of the developers of the proteomics software tool Percolator, which uses semi-supervised machine learning to help improve the confidence of peptide identifications in a mass spec experiment, thereby boosting the number of peptides and proteins researchers are able to identify in a given sample.
The software is one of several tools that take this approach, with perhaps the other most popular being the PeptideProphet tool developed by researchers at the Institute for Systems Biology.
Both software packages use data from the datasets they are analyzing to develop models for assessing the quality of peptide identifications in those datasets. Such an approach works well in large datasets with substantial numbers of high-quality identifications, but it is less effective in smaller datasets where solid identifications could be sparser, Noble said.
"What Percolator really needs is a significant number of high-quality, real identifications that it can pick out easily at the beginning and leverage to sort of bootstrap a model based on those, and the find additional [identifications] that are of sort of intermediate quality," he said.
A small dataset presents two potential problems, he said. In the first case, there might simply not be enough spectra period for the software to bootstrap a model. Second, even if there are enough spectra, they might be too noisy without enough high-quality identifications to provide material for modeling off of.
Noble highlighted single-cell proteomics, which has recently become a growing area of interest within the field, as a space were these problems often apply.
Single-cell efforts "don't tend to be huge experiments, and then even when they do have lots of spectra, they are pretty noisy spectra," he said.
The emergence of single-cell work along with other types of proteomics experiments like cross-linking mass spec where limited or noisy data is also an issue led Noble and his colleagues to look at how they might improve Percolator's performance in this area, he said.
To address the problem they developed a static modeling approach in which rather than training a model on the dataset being analyzed, the tool trains on a larger dataset to generate a model that can then be applied to the smaller experiment.
In the JPR study, Noble used the static modeling approach to analyze mass spec data from histone gel band experiments and from single-cell data generated using the Single-cell proteomics by mass spectrometry (SCoPE-MS) approach. In the former case, the shift to the static modeling methods boosted peptide identifications in 52 of 72 datasets analyzed, with some experiments seeing a rise in IDs of as much as 22 percent. In the single-cell data, the researchers found that static modeling increased peptide IDs in 58 of 65 experiments while also improving the consistency with which peptides were measured across experiments — a key consideration for comparing protein expression across samples.
Noble said his lab is using the approach for its work in crosslinking mass spec applications, which he noted was the original inspiration for the project.
"That was what initially led us to this application," he said, adding that his lab was currently putting together a paper demonstrating the use of the method for crosslinking experiments.
"There may be other people out there with other applications, as well," Noble said. "That is why we wanted to make it available, to let people try it out."