SALT LAKE CITY — There is a growing number of protein-identification software tools on the market, but they all generate very different protein lists — an issue that some in the protein-informatics community are struggling to reconcile.
At the Association for Biomolecular Resource Facilities conference held here this week, a number of sessions focused on the wide range of results that mass-spectrometry experiments can yield depending on the software used.
A survey conducted by ABRF’s Proteome Informatics Research group, or iPRG, highlighted this challenge. Study participants were given a common MS data set and database and told to provide a possible list of proteins that they might submit to a journal. While the survey only included 18 participants plus six iPRG members — a number deemed too low to draw any conclusions — it did lead to some interesting findings.
For example, Sean Seymour, senior staff scientist at Applied Biosystems and outgoing chair of iPRG, said that even when respondents used the exact same peptide identification and protein-inference software, they still reported very different protein lists. Overall, he said, there was a “significant difference in the number of proteins” reported by the different groups, but he stressed that the real significance of this is still unclear due to the low number of submissions.
One good sign was that there appeared to be very few protein-inference errors and “no gross inflation” of the number of proteins reported. However, he noted that this may not be representative of the broader proteomics community, and conceded that the results may be slightly biased toward those members of the community who are very comfortable with protein-inference tools.
Seymour said that while one goal of the study was to determine the current state of the field with regard to reporting protein lists, a secondary goal was establishing a “benchmark” for assessing the quality of protein inference for a reference data set — a goal that will not be possible without more study data, he said.
As a result, he said that the group has decided to keep the study open in hopes of getting more submissions. Further information is available here.
Apples vs. Oranges
David Tabb of the Mass Spectrometry Research Center at Vanderbilt University noted that there is a great amount of interest in comparing different database search algorithms, but said that it’s extremely difficult to perform a true apples-to-apples comparison due to the way these programs are designed.
As an example, he described his experience in comparing the performance of his MyriMatch algorithm with that of Sequest and X!Tandem for a paper that appeared in the Journal of Proteome Research last year.
The algorithms are fundamentally different in a number of ways, he said, including the way they perform spectral pre-processing, their criteria for candidate peptide selection, their methods for fragment ion prediction, and their scoring methods. Some of these differences make it impossible to compare algorithms head-to-head, even in the hands of experienced developers, he said. For example, he noted, “there is no way to configure Sequest and X!Tandem to use the same candidate peptides.”
In addition, he noted that these algorithms all use different file formats, and converting from one format to another may introduce errors, but there is no way to determine whether that has occurred. In his case, MyriMatch was designed to read mzData, Sequest to read .dta files, and X!Tandem to read mzXML, so he had to convert them in order to perform the comparison.
“What changes did we introduce? We’re not sure,” he said.
“It may be the case that different subsets of peptides are handled better than others in different search algorithms.”
Tabb cautioned that comparisons of protein search algorithms should not be viewed as a “horse race” in which one algorithm is determined to be the best for all situations. “It may be the case that different subsets of peptides are handled better than others in different search algorithms,” he said.
Some developers are using these differences in protein identification results to their benefit. Protagen, for example, presented a poster at ABRF indicating that by combining four algorithms — Mascot, Sequest, Phenyx, and ProteinSolver — their ability to correctly identify proteins was 44 percent better than any individual algorithm at the same false positive rate.
Martin Blüggel, director of Bio-IT at Protagen, told BioInform that the company uses the combined approach for its internal proteomics research and for its protein-services business. He noted that the process is computationally intensive, however, and requires a 156-CPU cluster to run.
Other firms are commercializing tools that combine results from multiple search engines. Reifycs, a proteome informatics startup based in Tokyo, presented a poster detailing its ProteinSuite software, which combines results from its own protein-identification algorithm with those of Mascot and X!Tandem. Mitsuhiro Kanazawa, director of Reifycs, told BioInform that the software lists all possible peptides generated by all the algorithms, and then allows users to set their own filtering parameters to determine the final protein list.
Proteome Software, meanwhile, was demonstrating a new version of its Scaffold software, which combines the results of Mascot, Sequest, X! Tandem, and Phenyx searches and delivers probability scores for each identification.
Scaffold 2.0, released at the conference, includes new quantitation capabilities and “more robust statistics,” according to Mark Pitman, sales and marketing director for Proteome Software.
Pitman said that there are currently around 180 labs using Scaffold, with about 250 licenses sold. He estimated that the available market for the software is on the order of around 600 labs.
Searching Spectra, Not Sequence
Other developers are trying to avoid the inconsistency of sequence searching by circumventing it altogether. Paul Rudnick of the National Institute for Standards and Technology described an effort underway at NIST to compile a library of MS/MS spectra of peptide ions generated by the tryptic digestion of proteins.
The goal of the project, he said, is to help developers build algorithms that will match newly acquired spectra directly to the spectral library rather than against a “theoretical” spectrum derived from sequence information, which is how current protein search algorithms operate.
So far, he said, the library contains around 187,000 consensus human spectra, which is still too few to support an approach that relies on spectral searching alone. Therefore, NIST is currently combining its MS Search 2.0 spectral library search software with the Open Mass Spectrometry Search Algorithm from the National Center for Biotechnology Information in order to develop a hybrid method that draws from the two approaches.
Rudnick said that the combined tool, which first uses spectral searching and then relies on OMSSA for all the unknown peptides, should be released “in the next several months” and will be hosted by NCBI.
Further information about the NIST library, as well as several other similar initiatives to develop spectral searching tools, is available here.