Skip to main content
Premium Trial:

Request an Annual Quote

Michigan Team Develops Open Searching Proteomics Software


NEW YORK (GenomeWeb) – University of Michigan researchers have developed peptide identification software that could improve open database searching in mass spec-based proteomics.

Named MSFragger, the software, which was described in a paper published this week in Nature Methods, allows researchers to more rapidly perform searches for unknown and unexpected protein modifications and variants.

In conventional mass spec-based proteomics, experimentally acquired mass spectra are searched against a reference database of predicted mass spectra derived from the gene sequence of the organisms being investigated. However, a wide variety of post-translational modifications, mutations, and splice forms are not included in typical reference databases, making them difficult to identify.

One of the challenges of identifying the diversity of proteins forms present in biological samples is the enormous search space required to capture these many forms.

"When you think about conventional database searching, you basically take an experimental spectrum, you find all the candidate-[matching] peptides in a database, and you sort of score them one by one," said Alexey Nesvizhskii, associate professor of computational medicine and bioinformatics at the University of Michigan and senior author on the study. "So, if you have a big database and you have a lot of candidate peptides, it takes a lot of time to score each theoretical spectrum against the experimental spectrum."

This has meant that, in practice, proteomic analyses typically explore one or a small number of modifications at a time — phosphorylation, for instance. Such commonly studied modifications represent only a fraction of existing proteoforms, however.

The picture becomes even more complex, Nesvizhskii notes, when you take into account the various chemical modifications that may be introduced during sample preparation and other steps.

"There are modifications that are sample protocol-specific and experiment-specific," he said. "Many of them we can explain, but many we cannot."

And then there are unexpected splice forms and mutations, which have grown increasingly relevant as interest has risen in proteogenomic analyses combining protein- and gene-level information.

The MSFragger software developed by Nesvizhskii and colleagues including Andy Kong, a graduate student in Nesvizhskii's lab and first author on the paper, using an indexing scheme developed by the researchers that organizes both precursor and fragment peptides in such a way that "we essentially score a [given] experimental spectrum against all theoretical peptides at the same time," Nesvizhskii said.

That, he noted, allows the software to match experimental and reference spectra 100 to 150 times faster than conventional approaches, which makes it more feasible to query the massive search space used in open searching.

In the Nature Methods study, the researchers used the software for several analyses, including of datasets from HEK293, HeLa, and triple-negative breast cancer cells, finding a wide range of modifications including some on amino acids that are not typically considered targets for modifications in conventional workflows. They also applied the method to a large protein interaction study, finding that MSFragger increased the number of peptide matches they were able to make by 32 percent compared to a traditional narrow search approach. They applied the method to an RNA-protein cross-linking study, as well, finding that the software identified fewer cross-linked species, 163 compared to 189 identified by traditional analysis techniques. However, the method did identify 29 previously unidentified species.

The study also provided what the authors said was a potential issue with the calculation of false discovery rates in conventional narrow searches. In traditional database searching, peptides are matched against the reference database and then a decoy database of sequences known not to be present in the sample, which allows researchers to benchmark the likelihood that the peptide matches they are making are false positives.

Looking at the 3,773 peptides identified via narrow searches but not by MSFragger, the researchers found that 1,139 of these peptides were likely false positives that were assigned to unmodified peptides in a narrow search. This number of false positives, they noted, was significantly higher than the corresponding decoy set (554 peptides), indicating that the decoys are not providing correct estimates of the false positive levels in the narrow search.

"What we discover with this open searching is that there are a lot of chemical and biological modifications that are unaccounted for in a regular search, in a closed search," Nesvizhskii said.  "And those spectra have to match somewhere, and we think that they predominantly, or with a much higher frequency, match to target sequences. Which sort of violates this assumption of equal matching to targets and decoys."

"We have to understand that and how it affects our false discovery rates and see if we need to correct some of our strategies that have been developed for FDR estimation," he said.

This observation is not new, said Ghent University Professor Lennart Martens, who was not involved in the Nature Methods study. He noted that he and his colleagues published a paper discussing this issue in the Journal of Proteome Research in 2011. As they then wrote, decoy database strategies are effective at distinguishing between correct and random hits, but they "do not model the issue of distinguishing between correct and close [but incorrect]" matches.

And, Martens said, this problem remains largely unresolved. "The bottom line is that any sufficiently large, but more appropriately, diverse, search space can and will confound any search engine," he said. "There's no way around this with current search engines as far as we can tell."

Narrow searching, limiting a search space to a set of known, expected peptides and modified peptides, has been the field's solution to this problem, Martens said. "And it actually works surprisingly well, though it blinds us to what is truly novel."

As the findings from Nesvizhskii and his colleagues suggests, applications like open-ended modification searches abandon the constraints that make narrow searching work, Martens said.

"The important thing is then simply this: can the identification engine distinguish between a good [but incorrect] hit, and the correct hit?" he said. "And in 2011, what we said, was that there aren't any such engines around right now, provided the good hits become 'good enough'. And good hits have a higher chance of becoming 'good enough' when the search space is sufficiently large and diverse."

Martens suggested that the MSFragger software and other open searching approaches developed since the 2011 JPR paper did not fully address this issue. "If there is to be a solution, it will require a dramatic change in the way we identify spectra," he said.

Nesvizhskii also noted that questions remained around how to assess false discovery rates in open searching.

"We did a really careful analysis in our treatment of false discovery rates for open searching," he said. "But like with [narrow] searching, we still have to work more in dealing with some of the artifacts of open searching."

One potential issue is that of chimeric spectra, in which spectra are produced by the co-fragmentation of co-eluting peptides. In such cases, "open searching and narrow searching may disagree, because they can find different peptides in the same spectrum," he said. "So, we need to better understand how frequently that happens and what effect it has on open searching."

Nesvizhskii said his lab is also still working to determine the best way to make protein inferences (determining which proteins are present based on the identified peptides) based on the open search results.

"Do we allow all modified peptides to be used as support for protein identifications, or only the most common modifications, and so on," he said. But we see that even if we take the most conservative approach to protein inference with open search results, only allowing unmodified peptides or peptides you would find in a typical closed search, even with that sort of conservative approach we only lose one or two percent of proteins while we gain 30 or 40 percent of peptide spectrum matches."

"I think there's additional work to sort of improve false discovery rate estimates for open searching and address some other issues, but I think in the future this strategy has the potential to become really useful," he said.