A team led by researchers at the University of California, San Diego has developed a method for the large-scale identification of complex post-translational modifications.
The technique uses synthetic peptide libraries as training sets for peptide identification algorithms, allowing the algorithms to learn PTM-specific fragmentation patterns, thereby improving their subsequent performance in actual biological samples.
In a paper published last month in the Journal of Proteome Research, the researchers applied the method, which they termed Specialize (for Spectra of complex PTModified peptides identification tool), to SUMOylation, finding that compared to current approaches it improved identification of SUMOylated peptides by 80 to 300 percent depending on the complexity of the dataset being analyzed.
The technique is aimed at what UCSD researcher and study author Nuno Bandeira termed "complex" PTMs, modifications that can contribute their own fragments with their own fragmentation patterns to an MS/MS spectra.
Most software for looking at PTMs "expects that the PTM is just inducing a change in the mass of the [modified] amino acid, either by, in most cases, increasing [the mass], or sometimes, decreasing it," Bandeira told ProteoMonitor.
What these programs typically don't consider, however, is "the possibility that the modification creates new fragments in the spectrum," he said.
Both SUMOylation and glycosylation, for instance, can create fragments that are large enough and reactive enough that they can take "a very prominent place in the spectrum, which can then throw off the traditional [peptide] identification tools," Bandeira said.
This, he noted, has presented something of a chicken and egg problem for the field. Researchers studying such modifications would like better mass spec software tools with which to identify them. However, in order to train better algorithms, software developers need large sets of identified modifications to work with.
"So you end up with this problem where you can't have the [software] tool without the spectra, and you can't have the spectra without the tool," Bandeira said.
To get around this, he and his colleagues turned to combinatorial peptide synthesis, which allowed them to generate libraries of modified peptides – SUMOylated peptides in the case of the JPR study – on which they could train their algorithm.
"With a very inexpensive synthesis – between a couple hundred to just under a thousand dollars depending on the modification of interest – we can generate a high diversity of peptides containing the modification of interest," Bandeira said. And this, it turns out, "is enough [for the algorithm] to learn the [modification's] fragmentation properties, and then use it to look for the PTM in larger datasets from a biological source."
The researchers are able to make the initial putative identifications of these synthesized PTMs due to the small size of the search space, Bandeira noted.
"We know what we are putting in, so it's not like we have to search a very large sequence space to know what these peptides are," he said. "There are artifacts from the synthesis process, so we still run a database search against only the synthetic sequences, but there are so few possible candidates against any spectra that it's not likely there will be many false positives in that identification."
After two passes through the synthetic library, the algorithm was ready for analysis of actual experimental datasets, Bandeira said. He said that an additional training run through biological data with higher sequence diversity would have been desirable but that in the case of SUMOlyation there was not enough publicly available data for such a run.
"Now that we have a tool to do the initial annotations, though, it's just a matter of accumulating enough until we're able to do that final step of training," he said.
Bandeira said that he and his colleagues chose SUMOlyation as the first modification to study using Specialize because its strong pattern of fragmentation makes it relatively easy to model. Their likely next target, glycosylation, could prove more difficult, however, due to interference from the glycan fragments.
"The way we fragment glycopeptides, we typically get either peptide fragments or glycan fragments. We don't tend to get much of a mixture of that," he said. "So it may not be possible to identify peptides [in cases where] we are only seeing fragments from the glycan."
Provided they are able to obtain the necessary spectra from their mass spec analyses, though, Bandeira said he expects they would be able to train their algorithm to recognize essentially any type of modification.
"What this pilot project taught us is that ... even though the sequence diversity [of the synthetic libraries] is not as large as that of biology, obviously, it is diverse enough that we can change a generic algorithm to do identification on large datasets," he said. "So I can't imagine why it would be any different for any other modifications."
In addition to glycosylation, Bandeira said he and his team are interested in using the method for identification of SUMOylation in organisms other than humans as well as phosphopantetheinylation.
They also aim to integrate the approach into the UCSD-based Mass Spectrometry Interactive Virtual Environment (MassIVE) mass spec data repository.
"That's something that we would really like to do and are putting together," Bandeira said. "Any dataset that comes to the repository, and older datasets to the extent that they are recoverable, we would like to run through a pipeline that applies all these tools to it – spectral library search, database search, modification search, spectral alignment."
"Instead of a dataset just going [into the repository] and sitting waiting for someone to become interested and download it, we want it to come alive, so that as new tools are developed we can go back and reanalyze this data to keep getting more out of it," he said.