NEW YORK (GenomeWeb) – A team led by researchers from Ghent University and the Max Planck Institute for Dynamics of Complex Technical Systems have developed a new software suite for metaproteomics research.
Named the MetaProteomeAnalyzer, the software package aims to improve peptide and protein identification in metaproteomics studies as well as downstream analyses such as linking identified proteins to specific organisms or biological functions present within a sample, Lennart Martens, a Ghent researcher and one of the leaders of the project, told GenomeWeb.
Metaproteomics concerns the study of proteins in environmental samples. These samples typically contain a variety of different organisms – including unknown organisms – which can make analyses considerably more complicated than looking at a single known species in isolation.
For instance, Martens noted, even in cases where researchers do have a solid handle on which organisms are present in a sample, good peptide identifications can be difficult to make due to high levels of similarity between organisms present.
Additionally, even if researchers are able to make good peptide matches, matching it to a specific protein expressed by a specific species can be challenging due to the fact that many samples contain closely related organisms expressing very similar proteins.
With the MPA package, which was detailed in a study published this week in the Journal of Proteome Research, Martens and his colleagues hope to provide better informatics tools for addressing these challenges, with the aim of facilitating a field that, he noted, is still in its very early stages.
"We saw an unmet need," he said of their decision to tackle metaproteomic analysis. "The sophistication of the data processing is inadequate and we saw that there was a lack of interest from the bioinformatics community in supplying strong new solutions, especially for downstream interpretation."
With regard to the first challenge of metaproteomics analysis – matching mass spectra to peptides – the MPA software allows researchers to use four different search engines to obtain the best possible matches.
Because organisms in the same environment often evolve together, many of their protein sequences can be quite similar, Martens said. "If you digest [a metaproteomic sample] in silico with trypsin you get a bunch of peptides that look a lot alike because there are single amino acid substitutions and things like that."
This, he said, means that search engines must identify not just solid hits, but also hits that are less solid but still potentially informative. For instance, a low quality for a particular peptide but might, in fact, be a good quality hit to a closely related peptide not represented in the database being searched.
To this end, they included the search engine InsPect, which Martens said is better at coping with partial matches than other search tools.
"The idea is that if you have a bacteria in the sample that you did not expect but also have a close cousin of that bacteria in the sample, normal search engines will [say that spectra from the unexpected bacteria] are not good enough hits to allow an identification," he said. "InsPect can allow a little bit of change in the sequence, so that you can still identify the actual sequence in the sample despite the fact that it is not represented in the database."
Similarly, the X!Tandem algorithm, also included in the MPA suite, has a "second pass" option that allows researchers to do follow-up searches of only proteins identified by at least one peptide in the initial search, which, Martens said, helps with identifying peptides featuring certain amino acid substitutions.
More important than the tool's multiple search engine functionality, however, is how it lets users sort out the downstream functional meaning of their peptide identifications, Martens noted.
"The ultimate goal [of metaproteomics] is not so much what peptides are there, but what species are represented and what functions are represented," he said. Given the variety of related species in a typical metaproteomic sample, however, this can be quite challenging.
"Imagine that you find a peptide and it matches to a particular protein in a particular bacteria," Martens said. "However, because [the sample contains] closely related organisms, this peptide also matches to a similar or the same protein in the bacteria's [close relatives]. So essentially you get evidence for a bunch of proteins."
To tackle this problem, the researchers turned to a graph database approach that allows them to link the various levels of information regarding identified peptides to better sort out how peptides map to proteins to organisms to function.
"We create on the fly a graph database, which is a network between all the bits of information that we have," Martens said. "We create a bunch of nodes – one node can be a peptide that is identified, another can be a protein, another can be a species, another can be, say, an enzyme classification."
"So you can imagine links from a peptide to a protein, from a protein to a species, from a protein to an enzyme classification," he added. "And then, of course, one peptide can link to multiple proteins, and one protein can have more than one enzyme classification, and one protein can have more than a single species, so you get this very complex picture of what this peptide could mean."
The PMA software's graph database function allows researchers to use graph theory to more simply run complicated queries of this multilayered data.
For instance, Martens said, "for a given experiment you could say, this particular function is only performed by this set of proteins that are all linked to the same enzyme classification, and we can trace these proteins back to this peptide and also [this] species. And if you are lucky, you will find that, for instance, a particular species is solely responsible for that function in that microbial environment."
This ability to identify specific biologic functions in specific environments and trace them back to their source is one of the key promises of metaproteomics. Metagenomics, which is considerably more developed than its proteomic equivalent, "can only get you so far," Martens said.
"If you find a particular sequence is in the metagenome, that doesn't mean that particular function is available in that eco-system," he said. "It's only when the protein is expressed in large enough numbers and in the right way that this function is active in the eco-system. So [metaproteomics] can really show you a map of what is functionally ticking in these organisms at the moment when you do the sampling."