NEW YORK(GenomeWeb) – The Global Proteome Machine database has released the latest version of its Guide to the Human Proteome with the resource containing strong evidence for 15,616 of the roughly 20,000 proteins in the human proteome.
The count is similar to that of other human proteome mapping projects like the Human Proteome Organization's Chromosome-Centric Human Proteome Project, which has identified around 16,000 human proteins, and the Human Protein Atlas database, which in its latest version contains data on proteins from 16,975 genes. This suggests a rough consensus in the field that between 15,000 and 17,000 human proteins have been detected thus far.
The 15,616 number represents proteins that have been identified with the highest level of confidence, Ron Beavis, a researcher at the University of Manitoba and head of the GPMDB, told GenomeWeb. Including proteins for which there is any evidence of their existence, the resource has identifications for roughly 97 percent of the predicted protein-coding genes in humans, he said.
The GPMDB collects data from other public repositories like PeptideAtlas, reprocessing the raw files and uploading them to the resource. According to Beavis, the database currently contains roughly 280,000 LC-MS/MS experiments and around 2.1 billion peptide IDs. Since its launch in 2004, the amount of data in the GPMDB has doubled each year, he said.
In terms of filling out the missing portions of the proteome, several classes of proteins stand out as having evaded identification to date, Beavis noted, giving as an example olfactory receptor proteins, which are commonly cited as a troublesome set of proteins in discussions of human proteome mapping.
"They only exist in very small amounts in a very specific tissue, so we don't really see those," he said. "And there are quite a few of those – maybe 700 or 800 genes."
He also cited oocytes, the thymus, and various portions of the brain, like the pituitary gland, as cells or tissues containing a number of specific proteins not yet confidently identified.
Beyond that, Beavis said, roughly three to four percent of the human proteome is likely not amenable to detection via mass spec.
For instance, he said, "The mitochondrian chromosome has 13 proteins on it, and 12 of those have been seen quite nicely, but one of those has never been seen and it just doesn't have a tryptic peptide on it."
Even if researchers went after it using a different enzyme for digestion, it would still likely elude them due to how firmly embedded in the membrane it is, he added. "That particular protein is quite short and has three or four membrane spanning domains with small beta turns between the domains – so this is just something that is stuck in the membrane, and chances are it's just impossible to get the thing out."
Beavis noted that beyond the basic research aims of the proteome mapping projects like the C-HPP, he saw the main benefit of such efforts as being increased collaboration between the proteomics field and nucleic acid researchers.
Traditionally, he said, "proteomics [researchers] have very much set themselves aside from genomics [researchers]. The way proteomics sequence collections are organized are all very protein-centric rather than looking at it in terms" of the relationship between proteins and their corresponding RNAs and genes.
"People in proteomics are becoming far more aware of setting their results in the context of what is going on in the genome," he said. "So they are become far more interested in things like single nucleotide variants and how they affect amino acid variants – things that before the C-HPP project weren't really talked about very much in proteomics even though they are very important."
Beavis noted that while the different fields are working more closely together of late, some outside the proteomics field remain suspicious of its data, particularly on the mass spec side.
"There are still people who insist people do western blots to prove proteins are there, even though western blots are so outdated now it would be like using a northern blot to confirm that a gene is on a chromosome," he said.
He added, though, that data quality is still an issue for the field, noting that he and his colleagues "reject about 25 to 30 percent of datasets that come out of journal papers because of one sort of quality issue or another."