Informatics scientists usually sum up the identity of a protein by giving it an accession number. Biologists sum up the identity of a protein or gene by an abbreviation for an often cryptic phrase describing the molecule. Connecting these two seemingly innocent things is one of the most challenging problems facing system designers for large-scale proteomics projects.
A quote from the 1996 Newsletter of the IUPAC and IUBMB Committees charged with solving the underlying problem of naming proteins sums up the situation: “Protein nomenclature, an outstanding example of a problem that is in need of solution but which has seen little or no progress ... during the many years of existence of the successive nomenclature committees of IUBMB.”
Let us imagine, for a moment, that we could look into the proceedings of a fictitious committee, evenly divided between biology and bioinformatics panelists, trying to solve this problem.
A bioinformatics panelist starts off, “From our point of view, the problem is so simple that it is almost a tautology. A protein is defined by its amino acid sequence, as translated from its mRNA sequence, as spliced from its primary transcript RNA, as transcribed from its genomic DNA. The simplest way to refer to the protein sequence is its database-specific identifier. If you could assign a name to every one of these identifiers we’d be done.”
One of the senior biologists answers, “Naming these things is more an art than a science, for reasons that range from the practical to the historical. The word ‘protein’ itself is actually much older than the concept of sequence. The idea that there are different types of proteins came from the observation that consistent fractions could be extracted from a natural source — such as blood or wheat flour — by changing the extraction conditions. These fractions had different properties, but they were all approximately the same in their chemical composition, consisting of nitrogen, carbon, oxygen, hydrogen, and usually sulfur.”
A protein chemist joins in, “The name of a protein and its fractionation behavior became synonymous and that use survives to the present day. ‘Prealbumin’ was the name given to a protein that runs slightly faster than serum albumin on an acidic starch gel. The original prion protein, PrP 27-30, was named 27-30 because it ran on an SDS-PAGE gel in a broad fraction between 27 and 30 kDa.”
Another biologist adds, “But it’s more complicated than that. When it was realized that some of these named fractions had specific functions, such as enzymatic activities, it became much more useful to name proteins by their function, if one could be found. Because of this change in concept, the original prealbumin fraction has been dissected into a-1-antitrypsin, orosomucoid, and thyroxine-binding prealbumin. Unfortunately, thyroxine-binding prealbumin is often referred to clinically as simply prealbumin, rather than its proper name, transthyretin.”
A bioinformatics panelist interjects, “OK, but what about the sequence? Surely the unique chemical structure of a molecule can be used to name the protein?”
A structural biologist colleague replies, “Understanding the molecular nature of proteins transformed the idea of what was contained in these isolates, but it was found that there may actually be two or more different fractions associated with a particular activity, even though neither fraction individually has the activity. This led to the notion that a functional protein may be composed of different subunits that combine together to produce the function. Each of these subunits was itself a peptide chain, each originating from a separate gene. This concept, and better isolation methods, led to a profusion of functional subtype-naming systems, e.g., glycophorin A and glycophorin B (aka major surface sialoglycoprotein a and major surface sialoglycoprotein d). And you must remember, even single protein chains are not unique molecular species: glycophorin A is actually thousands of different molecules, because of variable O-linked glycosylation.”
From the bioinformatics standpoint, this is well and good, but there are a bunch of amino acid sequences in a database that need names. Isn’t there a way to cut the knotted tangle of names associated with biological proteins and apply them to simple gene translation proteins?
At this point in our envisioned conversation, there is some name-calling and general decorum breaks down.
Fortunately for everybody, the answer to the final question actually is “yes, there is a way,” but it has required more than a simple nomenclature committee to do it — and it is being driven by a truly multidisciplinary approach. The Human Genome Organization has by fiat created the Human Gene Naming Commission, which has been quietly naming all of the genes in the human genome. HGNC has set up a simple set of rules that produce names that can easily be abbreviated into reasonably short acronyms, slicing the knot in a way that would have done Alexander proud.
The names are a snapshot of our current understanding of the function of a gene, even if “secreted frizzled-related protein 1” sounds a little arbitrary to the non-specialist. Henceforth, says the HGNC, this gene and the peptide sequences translated from it will be referred to as SFRP1 (or the number 10776). Another effort, On-line Mendelian Inheritance in Man, based at Johns Hopkins, has taken up the challenge of placing these symbols in the context of the biomedical literature. OMIM gives alternative names, symbols, and descriptive histories of how the names came about, what is known about how gene products are assembled into a functional protein, and much more.
It remains to be seen how the general community will react to this solution, but all indications are that this mixture of organizations, committees, and curators has finally cracked the sociological problem of protein nomenclature for Homo sapiens. That’s one species down, a few hundred thousand to go …
Ron Beavis has developed instrumentation and informatics for protein analysis since joining Brian Chait’s group at Rockefeller University in 1989. He currently runs his own bioinformatics design and consulting company, Beavis Informatics, based in Winnipeg, Canada.