SIENA, Italy — UniProt this week released a “complete” set of annotated human proteins containing 20,325 human protein entries originating from about 20,400 protein-coding genes.
The announcement, which comes eight years after the project was first unveiled at the bi-annual Siena Meeting here, represents only a first draft of all the proteins in humans, and much work still remains in deciphering their roles and meanings in biology, but it may also provide an important point of reference for further analysis.
The announcement was made during the Eighth Siena Meeting last week by Amos Bairoch, a professor of bioinformatics at the University of Geneva and group leader at the Swiss Institute of Bioinformatics.
While Bairoch presented the database as a “complete” set of annotated human proteins, he also said that the current entries in the database almost certainly will change with time. He said that he and his colleagues at UniProt are uncertain about the existence of about 400 of the proteins and expect them to be replaced eventually.
“We believe they don't exist, but we are not certain enough to delete them” from the list yet, he said.
In a similar vein, he said the list of proteins was gathered from “about” 20,400 protein-coding genes because it is uncertain in some instances which genes are coding which proteins because some proteins are coded by more than one gene, and one gene can code for two or more proteins having nothing in common in terms of their sequence.
To arrive at the 20,325 figure, the UniProt researchers applied a number of different strategies: they included all protein-coding genes with a Human Genetic Nomenclature Committee name. They also included all the predicted Ensembl genes that have been validated by studies.
They then took all the proteins from the Consensus CDS database and all proteins referenced in the Online Mendelian Inheritance in Man database together with validated proteins “from a set of cDNA projects where we thought that there were interesting things.”
“Proteomic study of the human proteome is still not half-done; far from it.”
In total, Bairoch and his colleagues pored through 45,000 papers to get at their results, he said.
Of the 20,325 proteins they identified, more than 11,000, or 56 percent, were included because they “had some information at the protein level that they exist,” Bairoch said. Another 8,000 proteins, or almost 40 percent, showed evidence of their existence at the transcript level. A little less than 300 proteins, or 1.4 percent, were inferred from their homology.
He and his co-researchers included 140 proteins, or .7 percent, from Ensembl where there was no information about the gene, but “it looked good — basically, they looked like real genes,” he said.
Along with sequences, the database includes sequence variants, including information about 46,000 single-amino-acid polymorphisms, of which 21,000 are linked with diseases. It also includes 13,500 protein isoforms covering 7,300 protein entries in the database, “which means that currently we already have 35 percent of the protein-coding genes [in UniProt] which code for at least two different protein sequences,” Bairoch said.
On post-translational modifications, UniProt has 60,000 experimentally confirmed or predicted PTMs. “If you want, there is not only sequence, but [also] information around it,” Bairoch added. “It's basically quite a lot of data.”
Bairoch previewed his findings in Amsterdam last month during the Human Proteome Organization's annual conference. There, Matthias Uhlen, director of the Human Protein Atlas, said that because of the funding headwinds his group faces a resource such as the UniProt database could be especially helpful in creating a comprehensive map of the human proteome.
Still, to Bairoch the results reaffirm his belief that knowledge about the function of proteins is “appallingly” low. “We have almost no information on the exact role of most of the actors in the complex play of the human proteins,” even with well-known and well-studied protein families, he said.
Citing as an example G-protein linked receptors, at which pharmaceutical companies have looked extensively, he said that almost 100 of such receptors have no published information about their ligands.
“Even for the 56 percent [of proteins in the UniProt database] where we have protein evidence, there's still a lot to be done in proteomics,” Bairoch said. “Proteomic study of the human proteome is still not half-done; far from it. As a protein, we only know that they exist, but we don't know their location, with which proteins they interact, the tissue specificity, and so on.”
The database also remains a work in progress. Along with the approximately 400 entries that will likely be deleted from it, just as many, if not more, will have to be added as they are discovered, Bairoch said. In particular, he cited small protein-coding genes as an area where work still needs to be done. A stable, truly complete set of human proteins will take years to develop, he said.
The existing 20,325 proteins will also have to be continuously annotated. Functional information about the protein and correct sequences will need to be continually added, as will polymorphisms, splice variants, domain information, and other data. And the field is only in the beginning stages of discovering the full extent of protein modifications, he added. As more data is published about PTMs, the challenge will be to “keep up with this massive influx of highly important data,” Bairoch said.
“There is no way genomic information is going to give us that information about PTM, no way that bioinformatics is going to help us completely predict post-translational modifications,” he said. “So basically proteomics and those groups that are building up their teams [looking at this area] are going to be crucial.”