Skip to main content
Premium Trial:

Request an Annual Quote

French Team Using Proteogenomic Technique to Curate Databases for Metaproteomic Research


NEW YORK (GenomeWeb) – The laboratory of French Institute of Environmental Biology and Biotechnology (CEA) researcher Jean Armengaud is developing a mass spec-based proteogenomic approach for assessing the quality of genome assemblies.

The effort is focused primarily on improving the reference genome sequences used in metagenomic and metaproteomic analyses, Armengaud told GenomeWeb this week.

He added that his laboratory plans to use the technique to develop curated reference genome databases that it will sell to commercial users, a service that he said he hopes to begin offering towards the end of the year.

Driven by the rise of next-generation sequencing and the improvement of mass spectrometry instrumentation and methods, proteogenomics, which aims to integrate genomic and proteomic data, has become a fast growing research area.

A primary area of interest for the field has been using proteomic data to improve annotation of genome sequences — identifying, for instance, portions of the genome annotated as non-coding that, in fact, do produce proteins, or better nailing down start codons.

Such approaches could prove particularly useful in metagenomics and metaproteomics given the complexity of the samples involved and the fact that many of the organisms under investigation are not especially well characterized at the molecular level.

Metagenomics and metaproteomics concern the study of genes and proteins in naturally occurring environmental or medical samples. Such samples typically contain a variety of organisms, including unknown ones, which can make analyses much more challenging than those looking at a single known species in isolation.

In a commentary published last month in Proteomics, Armengaud and his CEA colleague Olivier Pible, discussed the challenges of meta-omics analyses, highlighting in particular the need to improve the genomic and proteomic databases used in such work.

In the paper, they note the issue of incomplete or inaccurately constructed genome sequences and the challenges this presents to metaproteomics work, which must rely on these sequences as the basis of references databases for searching mass spec data. While reference genome quality is to an extent an issue in all proteomic work, it is particularly one for metaproteomics given the large number of organisms being studied and the fact that many of their genomes have not been as thoroughly studied as those of humans and major model organisms.

For instance, in the Proteomics paper, the authors cite the example of studies of the sea bacterium Roseobacter denitrificans that found that almost 10 percent of start codons were erroneous. Likewise with the genome of the bacterium Ruegeria pomeroyi, which, they noted, was found via proteogenomic analyses to contain a number of sequencing errors.

And while such errors can be corrected when identified, "most genomes unfortunately remain in their original forms, contributing to the propagation of annotation errors to other closely related, subsequently established genomes," they wrote.

Additionally, contamination caused during, for instance, sample handling, can contribute to errors in reference databases, Armengaud said.

With these issues in mind, he and his colleagues are using their proteogenomic approach to clean up these databases.

"When we learn that some of these genomes are bad in terms of sequencing errors or assembly errors or contamination, we systematically remove them from the databases [we use]," Armengaud said. He added that they are in the process of preparing a paper for publication detailing the method.

Armengaud and his colleagues are not alone in developing improved metaproteomic methods. For instance, in February, a team led by Ghent University researchers published on a new software suite for metaproteomics research.

Named the MetaProteomeAnalyzer, the software package aims to improve peptide and protein identification in metaproteomics studies as well as downstream analyses such as linking identified proteins to specific organisms or biological functions present within a sample.

The software makes use of multiple search engines to improve peptide matching. It also allows researchers to run complex searches linking proteins to specific functions in specific environments to better sort out how peptides map to proteins and then to organisms and to function, which could help with identification of proteins and organisms in a sample.