This is the third of four articles surveying leading proteomics researchers about the most notable achievements in proteomics during the 2010s. Part 1 can be found here, part 2 here, and part 4 here.
NEW YORK – Over the last 10 years, advances in the speed and throughput of mass spectrometry platforms have led to a tremendous increase in the amount of data generated by proteomics experiments.
Data alone isn't particularly useful, though. Along with a series of instrument advances, developments in bioinformatics transformed proteomics throughout the previous decade.
The rise of data-independent acquisition mass spectrometry neatly illustrates this fact. While implementation of DIA required a jump in instrument speed, it also needed informatics innovation, and DIA software development was a major area of focus throughout the 2010s.
Brett Phinney, head of the proteomics core at the University of California, Davis, selected a recent DIA approach developed by the University of Washington's Michael MacCoss and colleagues as his pick for the most significant proteomics innovation of the decade, adding that it produced "some of the most exciting data I have seen in 10 years."
Conventional DIA approaches use an initial data-dependent acquisition mass spec run to generate spectral libraries. Subsequent DIA runs are then analyzed by matching peptide fragmentation patterns and retention times from the DIA data to the DDA-generated spectral libraries.
Transporting retention times and fragmentation patterns from one platform to another is challenging, however, meaning that researchers often need to generate new spectral libraries for each experiment. Additionally, DIA analyses are limited to whatever peptides were detected in the DDA run used to build the library.
The MacCoss team's approach uses DIA data to build ion chromatogram libraries composed of peak shape and retention time that can be used to calibrate spectral libraries to particular mass spec and chromatography systems, allowing for effective sharing of spectral libraries. In an analysis of human and yeast cell lysates, they found the approach resulted in a 20 percent to 25 percent increase in identified peptides compared to traditional spectral library-based DIA methods.
MacCoss himself highlighted several other publications that he said were key to the development of DIA informatics throughout the decade, starting with a 2012 paper published in Molecular & Cellular Proteomics by ETH Zurich professor Ruedi Aebersold and colleagues in Switzerland that initially laid out the SWATH DIA strategy, which was the first DIA approach to gain widespread adoption.
Also influential, he said, was a 2016 study in which researchers at the University of Mainz in Germany presented a tool called LFQBench for benchmarking the performance of different DIA software packages.
That paper, MacCoss said, represented the first effort by the community to benchmark the various different DIA approaches circulating at that time.
He also cited a pair of papers from his lab, one published in 2013 and the other in 2019, that presented software using either multiplexed or overlapping isolation windows to improve the precursor selectivity and, thus, sensitivity, of DIA experiments.
Deep learning and machine learning
Phinney also noted the importance of deep learning for data analysis, echoing comments by several other respondents who cited the emergence of either deep learning or machine learning more generally as key developments in proteomics over the last decade.
"The one thing that struck me as revolutionary in the past 10 years is the in-earnest application of machine learning to proteomics data analysis," said Lennart Martens, group leader of the computational omics and systems biology group in the VIB-UGent Center for Medical Biotechnology.
He gave three reasons for his enthusiasm for this technology, the first of which he characterized as "unashamed bias," based on the fact that his lab steadily worked on such methods throughout the decade.
"The second reason is the adaptability of these approaches to the experimental conditions and instruments," he said. "This is important because proteomics, as a technology-driven field, has always been focused on getting the most out of the data that the instrument can acquire, which requires an understanding and prediction of the behavior of our analytes in these instruments. And that is precisely where machine learning methods shine."
"There's currently a bit of a hidden battle between old-school, handcrafted identification scoring functions and new, machine learning scoring functions, but it is already clear that the machine learning systems will win this contest," he added.
Finally, Martens noted that "proteomics might change dramatically over the next ten years," and might even "ditch the mass spectrometer as the instrument of choice to a lesser or greater extent."
"This will change just about everything about how we do proteomics, but it will not change the need for adaptive, machine learning-based algorithms to process the data generated from such new protein-sequencing approaches," he said. "If anything, these new approaches will need to rely on machine learning methods much more."
Indeed, emerging protein analysis approaches like nanopore-based protein sequencing commonly use machine learning to identify analytes. Last year, for instance, researchers at the Israel Institute of Technology published a simulation of nanopore-based protein sensing that indicated that nanopore measurements combined with deep learning data analysis could enable proteome-scale studies.
This echoed 2017 work by researchers at the University of California, San Diego and the University of Notre Dame that likewise found that machine learning analysis of nanopore protein data could enable large-scale proteomic studies.
Bernhard Küster, professor of proteomics and bioanalytics at the Technical University of Munich (TUM), also highlighted machine learning as a key development, noting that he believes that such technology "will change the face of proteomics informatics in a profound way, and pretty soon."
In April, Küster and his colleagues presented a software package called Prosit that uses deep learning to improve the matching of experimentally generated peptide spectra to the theoretical spectra contained in databases used for making protein identifications.
In a typical proteomics experiment, peptides are fragmented to produce a set of fragment ions from which mass spectra are generated. These experimental spectra are then matched to a database of theoretical spectra, allowing researchers to identify the peptides and, ultimately, proteins in a sample.
Interpreting these spectra requires an understanding of how particular peptides fragment, and while researchers have a general understanding and ability to predict this process, predicting what ions will be produced at what levels of intensity remains a challenge. As such, many software tools for matching peptide spectra assume that all possible peptide ions are equally likely to be produced.
Deep learning approaches offer the possibility of improved peptide-spectra matching by allowing researchers to train software to better understand the fragmentation patterns of specific peptides under particular conditions.
Researchers at the Max Planck Institute of Biochemistry in Munich and Verily released a similar deep learning-based software package around the same time called DeepMass:Prism that has been incorporated in the MaxQuant proteomics software package developed by Max Planck researcher Jürgen Cox.
Additionally, in 2017, a team led by researchers from the Chinese Academy of Sciences published a paper in Analytical Chemistry presenting a deep neural network-based tool for prediction of peptide spectra called pDeep, and in 2013, Ghent University researchers including Martens developed a machine learning-based tool called MS2PIP for predicting peptide ion fragment intensity that they have continued to enhance, with the latest update published in April.
Data sharing efforts
Robert Moritz, director of proteomics at the Institute for Systems Biology, took a somewhat broader view, saying that in his opinion, the most significant development in proteomics over the last decade wasn't a specific technique or breakthrough but rather "the coming together of the proteomic community to share, debate, and reuse data, and ultimately have these data deposited in large accessible databases."
"Efforts such as the various proteomic Atlases and the ProteomeXchange [Consortium] have propelled the field forward and allowed development of ever increasing ways to analyze data and provide high statistical validation of these data, making the community-sharing the most significant development of the last decade," he said.
Gilbert Omenn, professor of human genetics at the University of Michigan, likewise cited data sharing as a key development, highlighting the ProteomeXchange Consortium along with the PeptideAtlas and NextProt resources and their work reanalyzing "all publicly-available human proteomics mass spectrometry datasets with much-needed guidelines for credible detection and curation."
The dramatic spread of proteomic data sharing over the last ten years is perhaps even more notable given that the decade opened amid worries about the viability and sustainability of large proteomics data repositories. Specifically, one of the primary such resources, the University of Michigan-based Tranche repository, had begun cutting back its activities in late 2010 due to lack of funds.
The situation had become precarious enough that the journal Molecular & Cellular Proteomics had put on hold its mandate that all papers be accompanied by the submission of their raw mass spec data.
By 2015, the proteomics data storage community had become stable enough that MCP reinstated its raw data requirements as resources like the European Bioinformatics Institute's Proteomics Identifications Database, PRIDE, and the University of California, San Diego's Massive repository stepped in to fill the gaps. Additionally, the ProteomeXchange Consortium officially launched in 2011, providing a single framework and infrastructure through which researchers could access data from major repositories, improving coordination across these databases.
For Northwestern University professor Neil Kelleher, the decade's key advance was a perhaps even more foundational one — improvement in estimations of the false discovery rates used in assessing the validity of peptide assignments made in mass spec experiments.
"Everyone wants to race ahead, but setting a great baseline for operations is critical for people outside the field to embrace and value the output of the field we love," he said. "Not the most exciting of issues, but there were problems coming out [of] the 2000s, and in the 2010s we largely addressed them."