Skip to main content
Premium Trial:

Request an Annual Quote

UCSD Team Develops Software to Streamline Metaproteomic Database Searching


NEW YORK (GenomeWeb) – A team led by researchers at the University of California, San Diego has developed a database search tool for metaproteomics work.

Described in a paper published this week in Cell Systems, the tool aims to streamline the search process to allow for more rapid analysis of mass spec proteomics data, which will in turn allow researchers to more effectively search against the large pan-microbial databases used in metaproteomic research.

While proteomic experiments typically focus on a single organism, metaproteomic experiments aim to characterize the proteins present in samples that contain a large array or organisms, many of which may be unknown. This presents a variety of challenges, not the least of which is the time and computational resources required to search reference databases containing proteins from not just one organism but from millions of organisms.

"If you are just doing a search against [the human proteome], it is a relatively easy problem," said Vineet Bafna, a professor of computer science at UCSD and senior author on the study. More difficult, however, "is to go into a new environment [populated by unknown organisms] and do a search."

Bafna and his colleagues set out to address this problem by streamlining the search process in a way that would allow them to quickly discard unlikely peptide-spectra matches and focus their analysis time on strong candidates.

"Generally speaking, if you want a filtering tool it must have the following properties," he said. "It must be efficient in that it must discard a lot of searches without having to go deep into them. It must be sensitive in the sense that it should not remove the ones that you truly want. And it should be fast."

The tool, which the researchers have named ProteoStorm, uses three modules to speed up the search process. In the first module, the tool takes in silico trypsin digested peptides from a pan-microbial database and splits them up into different bins based on their mass. The experimental spectra are likewise organized into bins based on mass. The tool then searches spectra from a particular bin only against the in silico peptides in the corresponding bin, which significantly reduces the search space.

"If you have 1,000 spectra and they each need to be searched against 1,000 different peptides, then that is a million searches," Bafna noted. By grouping those 1,000 spectra and peptides into 10 bins of 100 and searching only the corresponding bins against each other, you reduce the number of searches by tenfold.

In the second module, the tool speeds up the search process by looking at just the b- and y-ions, which allows the researchers to quickly weed out peptides that are not a good match.

"For each database partition, we create all the theoretical b- and y-ions," said Miin Lin, a graduate student in Bafna's lab and co-author on the study. The researchers can then search the experimental spectra against the full set of these ions and discard peptide-spectrum pairs where the b/y ions are a poor match, she said, which allows the researchers to focus the more extensive analysis required to make a confident match only on good candidate peptide-spectrum pairs.

The third stage of the module draws from the MS-GF+ mass spec search tool developed by Bafna's UCSD colleague Pavel Pevzner. One feature of that software package is the ability to quickly calculate p values estimating the confidence of a peptide-spectrum match. Lin said the researchers extracted that code from the larger MS-GF+ package and incorporated it into ProteoStorm, which allows the tool to quickly assess the quality of proposed matches.

Certain elements of the ProteoStorm package have been used previously in proteomic database search software. For instance, Bafna said, the b/y ion indexing used in the second module is also a part of the MSFragger software developed by University of Michigan researchers for open database searching, though, he noted that that program is not optimized for metaproteomic work.

In the Cell Systems study, the researchers compared ProteoStorm to several existing search tools, analyzing 7.5 million LC-MS/MS spectra generated from an analysis of urinary pellets from 110 patients with suspected urinary tract infections and five health controls. They searched these spectra against a database of 18.8 million microbial sequences, finding that ProteoStorm completed the analysis in 1.79 CPU days compared to 123.8 CPU-weeks for the MS-GF+ tool.

Searching a subset of 900,000 spectra against the pan-microbial database, they found that at a 1 percent false discovery rate, ProteoStorm identified 13,550 peptides in 9.7 CPU hours. MS-GF+ identified 12,139 peptides in 22 CPU-weeks, the Comet search tool identified 9,341 peptides in 10.7 CPU-weeks, and MSFragger identified 11,530 peptides in 2.4 CPU-weeks.

The observed speed advantage of the ProteoStorm tool suggests it could allow for much broader metaproteomic searches than are currently practical, which, Lin noted, is potentially valuable given the undefined nature of many metaproteomic samples.

"It's important when you don't know what the taxonomic composition of your sample is," she said. "That was one of the major motivations for developing this tool."

Dennis Wolan, an associate professor of molecular medicine at the Scripps Research Institute whose work focuses on microbial environments and metaproteomics, questioned, however, why the UCSD team didn't compare ProteoStorm to existing software packages developed specifically for metaproteomics research.

He suggested that more apt comparisons would have been packages like the MetaProteomeAnalyzer (MPA) tool originally developed by researchers at the Max Planck Institute for Dynamics of Complex Technical Systems in 2015 or the ComPIL tool Wolan developed in collaboration with the lab of his Scripps colleague John Yates.

Wolan, who was not involved in the ProteoStorm work, said that based on the Cell Systems study, he believed the software might represent a small improvement over existing packages, but that he didn't think "it is a vast improvement over what is out there."

He added that "a lot of [metaproteomic software tools] are coming out, and it would be great if someone did a head-to-head comparison to see if there is anything that all the programs are missing that we could work on together."

Regarding a potential comparison to MetaProteomeAnalyzer (MPA), Bafna said that while that tool integrates the output of several existing mass spec search engines, it is not as focused as ProteoStorm on reducing the complexity of the initial search itself. He suggested that ProteoStorm might work well in combination with MPA, using the former to filter and simplify large sequence databases and the former to integrate the results.

An analysis of UTI samples in the study demonstrates the potential value of broader search capabilities, Lin said. In an initial UTI study, researchers limited their search database to a list of 20 genera they considered the most likely to be present in infected individuals and identified organisms from 15 of these genera.

When the UCSD researchers used ProteoStorm to expand their search to a pan-microbial database containing organisms from 2,259 genera, they identified both organisms from the 15 genera detected in the original experiment as well as organisms from an additional 49 genera.

Bafna noted that while the tool is primarily research-focused, it could potentially inform the development of clinical microbiology testing by helping researchers identify pathogenic organisms that should be included in clinical databases.

He said that he and his colleagues are now working on a proposal to use the software for metaproteomic analyses of breast milk.

"It all comes back to the same idea," Bafna said. "If you don't know what you are looking for, you want a tool that is unbiased."