Skip to main content
Premium Trial:

Request an Annual Quote

Researchers Build Complete Synthetic Human Proteome


NEW YORK (GenomeWeb) – A team led by researchers from the Technical University of Munich has completed a mass spec analysis of a synthetic peptide collection representing tryptic peptides to roughly all canonical human proteins.

The researchers have compiled the data from this analysis, which covers more than 330,000 tryptic peptides, in an online resource called ProteomeTools that could aid proteomic studies in a variety of ways, said TUM researcher Bernhard Kuster, one of the leaders of the effort and senior author on a Nature Methods paper published this week detailing the work.

Broadly speaking, the resource provides the proteomics community with a set of established standards against which it can compare experimental data. As Kuster noted, such an approach is typical in analytical chemistry, but has not been in shotgun proteomics, mostly due to the large number of molecules measured in a typical experiment.

Instead, experimental data — typically mass spectra — is matched to predicted spectra generated using in silico analyses based on the underlying genetic sequences of the samples being studied.

"In proteomics today we are doing everything by inference," Kuster said. "We have a tandem mass spectrum and we use a computer algorithm to match it to a peptide sequence that [is generated] in silico to simulate what their spectrum might look like without us actually knowing what it looks like. That is a very fundamental problem."

He noted that the high analytical performance of mass spectrometry and error correction approaches like false discovery rates have made proteomics quite effective at identifying and quantifying proteins, but suggested that a collection of peptide standards like ProteomeTools could further improve the field.

As he and his co-authors noted in the Nature Methods paper, the statistical approaches used for conventional peptide identification "invariably represent compromises in terms of the sensitivity and specificity with which proteins are identified from complex mixtures."

Kuster suggested several ways researchers might use the resource. For instance, it could be useful for confirming peptide identifications in borderline cases, he said. "Because the spectra for these synthetic peptides are available to everyone, you could look up a protein or peptide ID that you find exciting but where the [experimental] data might not totally convince you as to whether it is true or not."

Kuster added that the resource could allow the field to move away from conventional database searching methods towards a spectral matching approach. He noted that this shift is already ongoing in proteomics as data-independent acquisition methods, which use spectral matching, grow in popularity.

However, he said, because the spectral libraries used in DIA experiments are typically generated using an initial data-dependent mass spec run, these spectral library approaches confront the same question as a conventional discovery experiment using a database.

"You're suffering from the same problem of what you do with the marginal data from your discovery experiments, and how do you make sure these [low-quality spectra] don't populate your spectral libraries, which could then give rise to false matches," he said.

Kuster noted that in the course of their work, he and his colleagues identified apparently incorrectly identified peptides in spectral libraries generated this way. "In one of the supplements to the paper, we have an example of a spectrum in a spectral library where it really doesn't look like it is the peptide that these guys thought it was," he said.

Andreas Huhmer, global marketing director for mass spectrometry solutions at Thermo Fisher Scientific, which is supporting the ProteomeTools project, suggested the resource could be particularly useful for functional proteomics studies where researchers are trying to observe proteins of interest in particular tissues or cells.

"The obvious thing to do is to actually generate synthetic standards and generate them on a large scale so you have them not only [to make identifications] but so that you could actually use it to develop tools, either targeted assays or as spectral libraries for validation and confirmation," he said.

"Now that you actually have synthesized the peptides, you know exactly what its retention time is on a column, you know exactly what its spectrum looks like in an Orbitrap with different fragmentation energies, you essentially have an address and a relative retention time for every protein in the proteome," added Huhmer, who was a co-author on the Nature Methods paper. "This particular resource will drive proteomics in a very targeted direction and gives us a tool for a lot of functional studies.

Kuster and his colleagues used a two-tier system to determine what peptides to synthesize as part of the project, first identifying around 150,000 peptides to some 15,000 consistently and repeatedly seen proteins. For the rest of the peptides, they used in silico digestion of the underlying gene sequences to determine the peptides they should synthesize.

The resulting 330,000 peptides represent all the human gene products in the UniProt database. The ProteomeTools resource contains spectra for each generated on a Thermo Fisher Orbitrap Fusion Lumos using five different fragmentation methods and, in the case of the higher-energy collisional dissociation, using six different collision energies.

The project aims to ultimately generate around 1.4 million synthetic peptides including for various post-translational modifications and genetic variants as well as non-tryptic peptides, Kuster said. He suggested that these additional proteins, particularly the genetic variants, could prove useful for the growing proteogenomics field, which in many cases uses proteomic data to complement and enhance genomic analyses. 

"The proteogenomic community often faces the problem of detecting genomic variants at the protein level," he said. "It's actually very hard to do, but if we had synthetic reference spectra for them, then it becomes much easier to validate or invalidate such variants."

The resource could also be useful for nailing down the status of so-called "missing" peptides, molecules predicted by the genetic code but for which no experimental evidence exists.

For instance, one question regarding such molecules is whether they have gone undetected because they are not actually produced at the protein level or because they have physical characteristics that make them difficult to detect via mass spec.

"If we make synthetic peptides that we never see experimentally and we are able to see them, then it suggests that they are not produced [endogenously] in the first place," Kuster said.

If, on the other hand, they are unable to see the synthetic peptides using mass spec, it leaves open the possibility that these peptides are, in fact, being produced endogenously, but that they cannot be detected via mass spec.

Kuster said he expected that generating the full set of 1.4 million synthetic peptides and characterizing their spectra would cost between €3 million and €4 million ($3.2 million to $4.3 million). JPT Peptide Technologies is synthesizing the peptides for the project.

The ProteomeTools effort is funded by several private and public sources, including Thermo Fisher, SAP, and the German government.