Skip to main content

TUM Team Builds 200,000-Plus Peptide Library for Benchmarking, Optimizing Proteomic Research Methods


A team led by Technical University Munich researcher Bernhard Kuster has built a reference library containing more than 200,000 synthetic peptides and phosphopeptides that they said will enable the development, improvement, and evaluation of a wide variety of proteomic research approaches.

The first such resource of this size, the library will allow proteomics researchers for the first time to test on a large scale their workflows and algorithms against actual synthetic standards rather than computational and statistical models, Kuster told ProteoMonitor.

Comparison of experimental findings against synthesized molecules has traditionally been considered the gold standard in analytical chemistry, Kuster noted.

"For the identification of a substance, you synthesize it and compare physical properties [of the synthetic version] to what you have found in your discovery experiment," he said. If the two samples are the same across all parameters, then it can be concluded that they are in fact the same substance.

However, Kuster said, "this very basic principle has never really been applied in proteomics. We have all gone by the assumption that the experimental mass spectra will resemble that of a theoretical [spectra] we put together by calculating the b and y ions of the typical peptide fragmentation."

The synthetic peptide library generated by Kuster and his colleagues allows researchers to evaluate this assumption and the techniques and algorithms the field has developed to employ it – a "benchmarking," he said, "that we felt was highly overdue."

To facilitate access to the library, the researchers have struck an arrangement with Thermo Fisher Scientific under which they will provide the company with the library and the company will distribute it to interested scientists for what Kuster said will be a small handling fee.

In a paper published this week in Nature Biotechnology, Kuster and co-authors including Max Planck Institute researcher Matthias Mann and Utrecht University researcher Albert Heck presented a variety of analyses using the library, evaluating different proteomic search engines, fragmentation methods, and tools for phosphorylation site localization.

Generally speaking, Kuster said, their findings indicated that the field's tools do "quite a good job" but that "there is still substantial room for improvement."

For instance, using the library the researchers evaluated the two commonly used proteomic search engines Mascot and Andromeda, finding that the decoy database approaches conventionally used for calculating false discovery rates underestimated the true FDR as measured using the synthetic libraries by a factor of between 1.5 and three.

The researchers "were actually happy to see that the published [identification] algorithms do an OK job," Kuster said, noting that their findings demonstrated they struggled most in conclusively identifying peptide sequences where they are very similar to other sequences – an issue that he said was already "acknowledged in the field."

More surprising was the researchers' finding that phosphorylated peptides were considerably easier for the algorithms to identify than unphosphorylated peptides – a result that counters the current conventional wisdom.

It turns out, Kuster noted, that the addition of a phosphate group shifts the mass of peptides a specific amount, and that makes it easy to identify phosphorylated peptides provided the high mass accuracy of modern mass spectrometers.

"There is a notion in the field that [analysis] of phosphopeptides is difficult," he said. "But maybe the difficulty is more a matter of the natural abundance [of phosphoproteins] than the ability of the mass spectrometer to generate meaningful data."

The researchers also used the library to compare collision induced dissociation and electron transfer dissociation, determining that CID allowed for the identification of more peptides and phosphopeptides.

ETD did provide an advantage, though, in terms of phosphosite localization, Kuster noted. "Although ETD gets fewer [IDs] of everything, for the phosphopeptides that it does get, it gets the localization very right, so there is a reason for using ETD when it comes to phosphopeptides."

In general, Kuster and his co-authors found that the phosphopeptide localization tools MD Score, PTM Score, and phosphoRS all underestimated that true false localization rates, suggesting, he said, room for improvement. Absolute errors, however, "were small for the vast majority of the data," they wrote.

The peptide library "is an incredibly useful resource," Paul Rudnick, a researcher at the National Institute of Standards and Technology, told ProteoMonitor. "I think this will be the perfect dataset for algorithm comparison or tuning-type exercises that will really allow us to do better at providing confidence values that are really meaningful.

Rudnick was not involved in the Nature Biotechnology study but has obtained the study's dataset and plans to use it for building reference spectral libraries.

"This type of dataset is perfect for building mass spectral libraries," he said. "We always prefer things to be in as purified [a] form as possible so we have a handle on any impurities, so this is much better than looking at cell lysates or something."

Rudnick added that his confidence in the dataset's phosphosite localization data would enable him to include phosphopeptides in the reference library for the first time.

He also plans to use the data as a training set for selecting or improving phosphosite localization algorithms for use within the National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium, he said.

The actual synthesis of the peptide libraries was a fairly straightforward process taking roughly a week, Kuster noted. More complicated was the design work required beforehand to make sure the libraries were suited to answering the questions the researchers wanted to ask.

For instance, he said, they wanted the libraries to be representative of the sequence and phosphorylation diversity of an actual human proteome while avoiding the bias toward certain kinds of phosphorylation observed in experimental studies.

Looking ahead, Kuster said he and his colleagues may make new libraries to help answer additional questions. For instance, he said, they are currently considering generating collections of peptides featuring more than one phosphorylation, as well as a library of positional isomers of phosphopeptides.

The Scan

Pfizer-BioNTech Seek Full Vaccine Approval

According to the New York Times, Pfizer and BioNTech are seeking full US Food and Drug Administration approval for their SARS-CoV-2 vaccine.

Viral Integration Study Critiqued

Science writes that a paper reporting that SARS-CoV-2 can occasionally integrate into the host genome is drawing criticism.

Giraffe Species Debate

The Scientist reports that a new analysis aiming to end the discussion of how many giraffe species there are has only continued it.

Science Papers Examine Factors Shaping SARS-CoV-2 Spread, Give Insight Into Bacterial Evolution

In Science this week: genomic analysis points to role of human behavior in SARS-CoV-2 spread, and more.