NEW YORK (GenomeWeb) – Stanford University researchers have developed software for unrestricted searching of protein post-translational modifications.
Called TagGraph, the software tool uses string-based searching of de novo protein sequence data combined with a probabilistic peptide validation approach to allow for more effective unrestricted searching of mass spec data, said Joshua Elias, an assistant professor of chemical and systems biology at Stanford University and senior author on a Nature Biotechnology study published this week describing the approach.
In conventional mass spec-based proteomic workflows, experimentally acquired mass spectra are searched against a reference database of predicted mass spectra that are derived from the gene sequence of the organisms being investigated. However, a wide variety of post-translational modifications, mutations, and splice forms are not included in typical reference databases, making them difficult to identify.
Identifying the full diversity of protein forms present in a sample would require exploring a vast search space, which is both computationally intensive and challenging in terms of identifying correct peptide-spectra matches amidst a sea of plausible but ultimately incorrect matches.
The field has been moving forward on this problem in recent years, however, with several new open searching tools becoming available, including the MSFragger software released in 2017 by University of Michigan researchers and the Open-pFind tool released last year by researchers from the University of the Chinese Academy of Sciences.
TagGraph is the latest entrant to this growing field. The software is based on the idea that high quality mass spec data allows for accurate de novo sequencing of relatively long stretches of peptides and that these sequences can be rapidly searched against an indexed sequence database in an unrestricted manner to identify a subset of potential matches that can then be optimized using a graph-based alignment algorithm.
The notion is that "if we can just find the core of a particular de novo sequence that exists in the proteome, we should be able to reconcile the de novo sequence against the proteome sequence in a canonical database of proteins," Elias said. "That allows us to figure out [things like] isobaric substitutions and also post-translational modifications or mutations or even inserts and deletions."
The next step, he noted, is scoring the matches made by the tool, which is particularly a challenge for unrestricted searches where the large search space involved means researchers may be faced with a number of good but incorrect matches for a given spectrum.
"You can get a result where the peptide matches the spectrum pretty well, but it could still be wrong in all sorts of different ways," Elias said.
In traditional database searching, peptide matches are scored by matching them against the reference database and then a decoy database of sequences known not to be present in the sample, which allows researchers to benchmark the false discovery rate or likelihood that the peptide matches they are making are false positives.
This approach works well in restricted searches limited to sets of expected peptides and modifications. However, it is less effective for unrestricted searching.
In TagGraph, Elias and his colleagues replaced the target-decoy approach with a hierarchical Bayes model that uses 14 "quantitative and categorical attributes," half of which "relate specifically to modified peptides," to score the likelihood that a given peptide-spectra match is correct.
The researchers evaluated the approach on a dataset consisting of 25 million mass spectra generated from the analysis of tissue samples from 30 subjects, finding that in six days of computing time on a desktop computer they were able to identify 1.1 million unique peptides, triple the number initially identified by conventional analyses. They also identified a wide range of lesser-abundant PTMS like N-terminal myristoylation, lysine hydroxylation, and arginine demethylation.
Elias said that unrestricted searching tools like TagGraph will allow researchers to "really view the long tail of all the different species that exist in a particular [proteome]," particularly as improved instrumentation allows researchers to look deeper into the proteome.
Ten years ago, you could generate, say, 10,000 spectra per data set," he said. "That would get you a pretty good [amount of] protein IDs, and we may not actually get very many more [protein IDs] with the faster scanning instruments [of today]," he said. "But now if we can generate 10 times more spectra in the same amount of time, what are those spectra? They're going to be more forms of those proteins. So, we can really look for sub-stoichiometric modifications in a much larger set without really going through the pain of enriching them and looking at them all at once and seeing how they co-occur."
Lennart Martens, group leader of the computational omics and systems biology group in the VIB-UGent Center for Medical Biotechnology, said he believed proteomics was on the cusp of a major advance as unrestricted search tools improve to where they become feasible alternatives to traditional approaches.
"We are looking at a big and extremely interesting shift in the way we do proteomics," he said, adding that he predicted the move into unrestricted searching will make the field "much more interesting to a lot of people … because if there is one piece of information that is only accessible at the protein level, it is post-translational modification."
"What you are effectively seeing is a dramatic expansion in the capabilities of the entire field without having to change anything on the instrumentation side," he said, adding that he expected that unrestricted searching would become standard in proteomics experiments in the next several years.
In the meantime, Martens, who was not involved in the TagGraph work, said the field was working out what the optimal approaches for such searching would be.
He said he found the TagGraph tool to be an interesting option, noting that it combined existing elements of proteomic analysis such as de novo sequence searching with a Bayesian scoring model similar to that originally put forth by the PeptideProphet tool (which Elias likewise noted as the originator of this approach) to address the challenges of unrestricted searching.
Martens is one of the developers of an unrestricted search tool called Ionbot, which uses machine learning to better score peptide-spectrum matches. He and his colleagues have yet to publish on the method but have made it available to outside researchers and hope to have a study on its performance out this year.
Martens said that his teams used Ionbot to analyze the same 30-sample dataset Elias and his co-authors analyzed using TagGraph, which provided an opportunity to look at where their analyses overlapped and where they differed.
For instance, Martens said the two tools found similar amounts of modifications including phosphorylation and formylation, which he noted is a highly abundant artifact in proteomics experiments. In the case of other modifications, the two tools produced highly divergent results. The Ionbot analysis found far more acetylation than TagGraph, Martens noted. On the other hand, TagGraph found more than twice as much methionine oxidation as Ionbot.
He said that such divergence between search tools was once common among conventional restricted search software packages, as well, but that over time these tools have converged to produce highly overlapping datasets.
Ten to 15 years ago, "overlap was on the order 70 percent or so," he said. "But, over time everyone started tweaking their search algorithms … and nowadays the differences for normal shotgun proteomic datasets are on the order of a few percent. For every 4,000 identified peptides, there will 50 or so that are unique to a particular [search] engine."
The differences between the open search results "are very interesting patterns to see, and I have no idea of who is right or wrong about them," Martens said. "But it is indicative of the fact that we aren't there yet."