NEW YORK (GenomeWeb) – A team led by researchers from the University of the Chinese Academy of Sciences has developed a proteomic search tool that enables open searching of mass spec data.
The search tool, named Open-pFind, allows users to search for unexpected modifications and other phenomena that are difficult to identify using conventional mass spec search tools.
In a study published this week in Nature Biotechnology, the developers used the software to analyze four large-scale mass spec proteomic datasets and found that it identified between 70 and 85 percent of the spectra in these datasets, for a total of 14,064 proteins, while outperforming seven existing search engines in terms of precision and speed.
In conventional mass spec-based proteomics, experimentally acquired mass spectra are searched against a reference database of predicted mass spectra that are derived from the gene sequence of the organisms being investigated. However, a wide variety of post-translational modifications, mutations, and splice forms are not included in typical reference databases, making them difficult to identify.
One of the challenges of identifying the diversity of proteins forms present in biological samples is the enormous search space required to capture them all. However, as researchers have become increasingly interested in measuring not just the different proteins in a sample but the different protein isoforms, demand has grown for tools capable of exploring a wide range of protein forms.
This has led a number of groups to pursue open search tools. Last year, for example, University of Michigan researchers released an open search platform called MSFragger. Also, at last week's annual meeting of the Human Proteome Organization, researchers from VIB-UGent in Belgium presented a new open search tool called Ionbot.
Open-pFind uses a tag-indexing approach to allow for efficient open searching, said Hao Chi, an associate professor at the Chinese Academy of Sciences and the first author on the paper. In this approach, short peptide tags, typically consisting of around five amino acids, are extracted from each spectrum and then searched against a protein database indexed by those tags.
"After finding the matched positions in the database, peptide candidates are generated by extending each of the matched tags to a full-length peptide sequence," Chi said. " Peptides that fit at least one flanking mass of the tag are considered, and the mass shift on the other side is considered a potential modification if the mass appears in the given modification list. All modified peptides with different modification site localizations are generated and then scored for a given spectrum."
The tool then selects the peptide with the best score for each spectrum.
By first matching peptide tags from experimental spectra to peptides in the database containing those tags, the researchers are able to narrow down their search space prior to scoring potential matches. They can then perform a wider-ranging search in terms of potential modifications or other alterations in this restricted search space.
Chi said that one key to the technique is the use of relatively long (for indexing purposes) tags of five amino acids.
"A longer tag is more specific to one peptide, [which means] the search engine can spend a short amount of time extracting these longer tags and more efficiently search the database," he said.
Lennart Martens, group leader of the computational omics and systems biology group in the VIB-UGent Center for Medical Biotechnology, compared the approach to an hourglass.
"You take the tag as a means to limit the number of possible peptide hits, and then you expand that again by allowing open modifications and mutations, and then you have a scoring engine that works like a normal scoring engine on this new database," he said. "So you get a reduction step in the database size and then an expansion step."
Martens, who was not involved in the Open-pFind work and is one of the developers of the VIB-UGent Ionbot tool, noted that the peptide tag search approach was initially developed in the lab of Max Planck researcher Matthias Mann and was originally implemented as a method for reducing protein database search spaces by Stellenbosch University professor David Tabb, who was then a researcher in the lab of Scripps Research Institute professor John Yates III.
In open searching, "there is a huge combinatorial mess that makes your input database very big, which is very hard to handle algorithmically — there are just too many things you have to calculate," Martens said. "So using these tags as a prefilter is a great way of making that space manageable."
Martens said he thought the Open-pFind tool was a nice effort that he believed would draw interest from researchers looking to do open searching, but he noted that it did not appear to address what he considered a major outstanding problem in the field — the lack of a good scoring function for evaluating peptide-spectra matches in open searching.
While the tag-based filtering approach reduced the initial search space, opening that restricted space up to a wide range of peptide alterations means researchers now "have a lot of possible open modification peptides as targets," he said. "And the trick is to separate those targets from each other, and that requires a very sensitive scoring function."
"The scoring function [in Open-pFind] doesn't really go conceptually beyond what current [conventional] search tools use," Martens said, noting that in his opinion, this critique also applies to MSFragger.
"So in that respect, the jury is still out on how effective it will be in actually calling, 'Is it this modification on this peptide rather than that modification on that peptide?'" he said.
Martens added that his team's Ionbot tool takes a fundamentally new approach to scoring peptide-spectra pairs, using machine learning to evaluate matches.
"We define what the spectrum should look like, given the peptide that is a potential identification, and then we compare the spectra both in terms of m/z and intensity, and we give those features to a machine-learning algorithm and the outcome of that is highly reliable," he said, noting that he believes this approach will improve the reliability of peptide-spectra scoring in open searching.
He and his colleagues have made the tool available to outside researchers, several of whom are currently testing its performance. They haven't yet published a paper on the tool, but Martens said he hoped to have a study on it out in the next three to four months.
More generally, Martens said he believes that with growing interest in open searching and the release of tools like MSFragger, Open-pFind, and Ionbot, it is "high time" for the proteomics community to launch some third-party evaluations of these tools.
Chi and his colleagues compared Open-pFind to seven other commonly used search tools, including MSFragger, and found that it outperformed all of them, identifying more distinct peptides in less time and identifying between 12.8 and 94.3 percent proteotypic peptides for proteins identified in common by Open-pFind and another search engine.
Martens suggested, though, that more independent evaluations are needed, and added that it might be something an organization like the Association for Biomolecular Research Facilities could tackle as part of one of its regular informatics challenges.
The need is particularly pressing, he said, given that open search algorithms "are very much the future."
"In five years, everybody will be using [open search software]," he said. "We won't be using traditional search engines anymore. That is my heartfelt belief."