NEW YORK (GenomeWeb) – The Human Proteome Organization's Chromosome-Centric Human Proteome Project (C-HPP) continues towards its goal of identifying every protein in the human proteome with a recent update placing the group's overall coverage at roughly 85 percent.
That coverage figure will likely rise with the publication this summer of a special issue on the project in the Journal of Proteome Research, said Gilbert Omenn, chair of the HPP and director of the Center for Computational Medicine and Bioinformatics at the University of Michigan.
Based on the NextProt version 2014-09-19 and PeptideAtlas 2014-08 databases, the C-HPP consortium has yet to identify 2,948 proteins out of a total of 20,055 protein-coding genes in the human proteome, Omenn told GenomeWeb this week. He added that a number of papers to be published in the forthcoming JPR special issue are reporting proteins findings from that set of 2,948 and that these findings will be incorporated in the next versions of NextProt and PeptideAtlas.
Several of the papers also provide estimates of the number of predicted proteins that won't be feasible to detect via mass spectrometry, Omenn said, though he did not have an exact number yet.
In a paper published this week in JPR, the C-HPP group provided an update on its recent progress, including a discussion of issues hindering discovery of the remaining proteins and potential ways forward.
First proposed at the 2010 HUPO meeting, the C-HPP formally launched at the same meeting two years later. The project calls for participating countries to take one of the human chromosomes and characterize one representative protein for each gene located on the chromosome with the ultimate aim of characterizing the entire human proteome.
The effort consists of the two stages. The first, which is ongoing and slated to run through 2018, is focused primarily on mapping and characterizing the remaining uncharacterized proteins as well as their post-translational modifications, alternative splicing transcripts, and non-synonymous SNPs.
The second phase, which is planned to run from 2018 to 2022, will focus primarily on validating data from the first phase along with functional studies and developing drug targets and biomarker candidates.
The project has made steady gains since its launch, bringing the number of uncharacterized proteins down from an estimated 6,000 at the 2012 HUPO meeting, to roughly 3,500 to 4,000 in late 2013, to the 2,948 that remain outstanding today. Progress has lagged behind some predictions, however. In a 2013 interview, Northeastern University researcher William Hancock, one of the co-chairs of the C-HPP, told GenomeWeb he anticipated that the group would complete its initial goal of characterizing the roughly 20,000 proteins in the human proteome by the end of 2014.
Speaking to GenomeWeb this week, Hancock noted the variety of challenges facing researchers as they work to winnow down the list of outstanding proteins. Among them are proteins, such as olfactory receptors, that are thought to be expressed only in specific, rare tissues; proteins like membrane associated proteins that are not amenable to sample prep methods commonly used in mass spec workflows; and proteins expressed at too low an abundance to be readily detected by mass spec.
Another challenge, he noted, is annotation of the human genome itself. As the JPR authors wrote, some of the unidentified proteins may be the result of "erroneous annotation of the genome, which results in incorrectly predicted protein sequences."
The study published this week highlighted several approaches the chromosome teams are using to track down the remaining proteins. For instance, in the case of proteins expressed only in specific tissues or cell types, proteogenomic methods combining RNA analysis with targeted mass spec could enable researchers to use mRNA data to identify the specific sample type in which the corresponding proteins are most likely expressed. This could then be followed by targeted mass spec analysis using multiple-reaction monitoring or antibody enrichment to detect such proteins.
Another way forward is to use different workflows to "expand the chemical space," in which the researchers are able to search effectively. The JPR study cited work by the C-HPP's Chinese team that found in an analysis of multiple tissues and cell lines that hydrophobicity and low molecular mass are key properties for predicting unsuccessful detection of a protein.
The authors called for the use of more specific enrichment methods for targeting such difficult proteins, citing in particular work using a two-dimensional chromatography combining high-pH reversed phase strong anion exchange and low-pH RP stationary phases that increased identification of missing membrane proteins.
Beyond working on ways to expand proteome coverage, Hancock cited as a major achievement of the C-HPP the stimulation of development of the ProteomeXchange resource, which allows researchers to submit a variety of proteomics data that can then be moved and shared between various repositories including the European Bioinformatics Institute's PRIDE repository and the Institute for Systems Biology's PeptideAtlas and PASSEL repositories.
Data sharing is of high importance to the C-HPP project, which mandates submission of all raw mass spec data generated by the project. This allows for reanalysis by other researchers as well as litigation of disputes between researchers regarding the quality of data.
And, as the project continues to delve deeper into the proteome, questions of data quality will likely take on even more weight. As Swiss Federal Institute of Technology Zurich researcher Ruedi Aebersold observed at the 2014 HUPO annual meeting, the field has reached "the phase where there is quasi saturation of shotgun [mass spec] discovered proteins."
When dealing with such large, saturated datasets, confidently claiming new identifications is statistically difficult, he said, because these identifications will typically be within the margins of error.
With this in mind, "there is quite a bit of effort [within the C-HPP] going into improving the quality of identifications in terms of limiting false positives," Hancock said, "such as the length of a peptide – how many residues do you need [for a confident ID]? Seven or nine? "
On the other hand, he noted, "if you are after alternative splice variants, it may well be that there is one small difference in one small region [of a peptide], and so that may be all that you have."
"How can you do a better job there?" Hancock asked. "Once you identify the peptide with a small difference and you have good quality MS/MS spectra, do you then have to get a synthetic peptide and match its fragmentation spectra?"
"So there is all this work going on," he said.