Skip to main content

Deep Learning Shows Promise for Improving Proteomic Data Analysis


NEW YORK (GenomeWeb) – A pair of studies published last month in Nature Methods suggest growing momentum for the use of deep learning approaches for proteomic data analysis.

Working independently, a team led by researchers at the Max Planck Institute of Biochemistry and Verily and a team led by researchers at the Technical University of Munich (TUM) developed deep learning tools for predicting patterns of ion fragmentation in mass spec-based proteomics.

According to their developers, the tools could help boost the number of proteins confidently identified in proteomic experiments and could also streamline data-independent acquisition mass spec work by allowing researchers to run such experiments without first generating sample-specific spectral libraries.

The software packages, called DeepMass:Prism by the Max Planck team and Prosit by the TUM team, are the latest of several efforts to apply deep learning methods to the analysis of mass spec proteomic data. For instance, in 2017, a team led by researchers from the Chinese Academy of Sciences published a paper in Analytical Chemistry presenting a deep neural network-based tool for prediction of peptide spectra called pDeep. In 2013, Ghent University researchers developed a machine learning-based tool called MS2PIP for predicting peptide ion fragment intensity. They have continued to enhance the tool since then, with the latest update published in Nucleic Acids Research in April.

In fact, efforts to use machine learning for prediction of peptide ion fragmentation patterns goes back more than a decade, noted Alexey Nesvizhskii, professor of computational medicine and bioinformatics at the University of Michigan.

"There has been a lot of work trying to do predictions of [fragment ion] intensity using machine learning," he said. Nesvizhskii was not involved in the Max Planck or TUM work.

Jürgen Cox, group leader in computation systems biochemistry at Max Planck and senior author on one of the Nature Methods papers, said, though, that advances in computing power, deep learning methods, and the availability of training data, had set the stage for increased use of such approaches.

In a typical proteomics experiment, peptides are fragmented to produce a set of fragment ions from which mass spectra are generated. These experimental spectra are then matched to a database of theoretical spectra, allowing researchers to identify the peptides and, ultimately, proteins in a sample.

Interpreting these spectra requires an understanding of how particular peptides fragment, and while researchers have a general understanding and ability to predict this process, predicting what ions will be produced at what levels of intensity remains a challenge. As such, many software tools for matching peptide spectra assume that all possible peptide ions are equally likely to be produced and at the same intensity.

Deep learning approaches offer the possibility of improved peptide-spectra matching by allowing researchers to train software to better understand the specific fragmentation patterns of specific peptides under particular conditions.

The TUM researchers trained their Prosit tool on their ProteomeTools resource, a synthetic peptide library containing 550,000 tryptic peptides and 21 million tandem mass spectra.

"We feed that [data] along with [peptide] retention times into a deep neural network machine learning algorithm to essentially predict from sequence alone what the tandem mass spectra would look like, said Bernhard Küster, chair of proteomics and analytics at TUM and author on one of the Nature Methods papers. "Not only what fragment ions we would find, but also their relative intensities and the retention time of the peptides."

Having highly accurate predictions of ion fragment intensities allows for more confident assignment of spectra to peptides, which lets researchers identify more peptides and proteins from a given dataset, Küster said, noting that when he and his colleagues used Prosit to rescore existing datasets, they were able to identify the same number of peptides but at a false-discovery rate (FDR) between 10-fold and 100-fold lower than was used in the original analysis.

"We rescue a lot of peptide identifications that otherwise get cut out by the classic target-decoy FDR cut-off," he said. "The reason why that works is that the Prosit algorithm leads to much better discrimination of target and decoys."

The authors noted, however, that combining the Ghent University team's MS2PIP tool with the Percolator proteomic data processing software achieved similar identification rates, and Cox suggested with regard to his team's work that deep learning tools might ultimately prove most valuable not for improving peptide identification rates but for generating spectral libraries for DIA mass spec experiments.

DIA mass spec experiments typically require researchers to first generate a spectral library using standard data-dependent acquisition mass spec. The DIA spectra generated in subsequent experiments are then matched to the spectra in this database.

"I think it is really feasible to replace [experimentally generated] libraries," with deep learning spectral prediction tools, Cox said. "In the long run I don't think that we will be generating libraries anymore, which is actually the part that makes DIA a little bit work intensive, especially for smaller labs."

The software tool DIA-Umpire, developed by Nesvizhskii's lab in 2015, allows researchers to perform DIA experiments without first generating a spectral library, but Cox noted that it typically yields fewer protein identifications than conventional DIA approaches.

A 2016 study benchmarking different DIA approaches did, in fact, find that DIA-Umpire identified fewer peptides and proteins than more typical DIA methods, but it also found that the tool's performance improved significantly with the quality of the mass spectra generated, suggesting that it might make big gains as mass spec technology continues to improve.

Nesvizhskii said that he thought the possibility of using deep learning tools like Prosit or DeepMass:Prism to generate spectral libraries for DIA experiments was interesting but added that neither Nature Methods paper had "demonstrated the true practical utility" of such an approach.

He noted that a key advantage of DIA analyses is that instead of searching against a spectral library representing an entire proteome, researchers search against a library consisting only of peptides identified via an initial DDA analysis.

"So you are reducing your search space to peptides that you know are in the sample because they were previously identified in the same sample," he said, adding that the TUM and Max Planck work did not address this narrowing of the search space.

"I'm not saying that it won't work," Nesvizhskii said, "just that more work is need to demonstrate that it is practically useful [for spectral library generation]."

Both the TUM and Max Planck teams plan to continue training their tools to deal with additional kinds of peptides and mass spec fragmentation modes.

Kuster said he and his colleagues have built a library of around 150,000 synthetic HLA peptides that it plans to use to train the Prosit tool to predict fragmentation patterns. They are also developing the tool to predict fragmentation of post-translationally modified peptides.

Cox said his group is also looking at expanding the tool to post-translationally modified peptides and non-tryptic peptides.

"We are probably going to update our model on a regular basis, every few months make a new release," he said. "We will try train [it] on everything we can get from [proteomic data repositories.]"

For the DeepMass:Prism work presented in the Nature Methods paper the researchers used 25 mass spec datasets containing more than 60 million tandem mass spec spectra.

Cox said the DeepMass:Prism tool will be incorporated into the MaxQuant proteomics software package developed by his lab. The TUM team's Prosit tool is available as part of the university's ProteomicsDB resource, Kuster said.

The Scan

Call to Look Again

More than a dozen researchers penned a letter in Science saying a previous investigation into the origin of SARS-CoV-2 did not give theories equal consideration.

Not Always Trusted

In a new poll, slightly more than half of US adults have a great deal or quite a lot of trust in the Centers for Disease Control and Prevention, the Hill reports.

Identified Decades Later

A genetic genealogy approach has identified "Christy Crystal Creek," the New York Times reports.

Science Papers Report on Splicing Enhancer, Point of Care Test for Sexual Transmitted Disease

In Science this week: a novel RNA structural element that acts as a splicing enhancer, and more.