Protein Analysis is a new frontier calling out for Pioneers
by Nat Goodman
Proteomics is biology’s next wild frontier. Lots of great ideas are competing to stake out the territory, but none has emerged yet to lead the charge. No question, this stuff will be important; proteins are the workhorses of the cell. But where, exactly, the riches lie remains a mystery. If you’re getting into proteomic informatics, plan on blazing your own trail. I suggest you bring plenty of food and water.
The central problem in proteomics is to identify and quantify the abundance of the proteins present in a biological sample. From 35,000 feet up, this looks just like the problem of analyzing gene expression profiles that I wrote about last month, the obvious difference being that now we’re concerned with proteins instead of mRNA transcripts. At ground level, however, major differences in biology and biotechnology make these problems quite dissimilar. This means you’ll need different databases and software to support a proteomics effort.
I’ll organize my exploration of this territory around the challenge of creating a general-purpose proteomics database. This is not a terribly realistic destination, as it will probably be several years before anyone builds such a database. But it’s a good way to uncover the range of issues that will arise. Near term, most practicing computational biologists will deal with subsets of this whole.
From a bioinformatics perspective, the key difference between proteomics and transcriptomics is that proteins have many more attributes that affect their biological function. The data models we create for proteins will have to be a lot more complicated than those we have traditionally used for transcripts.
For an mRNA transcript, the sequence is basically all you need to know. There are a few cases where the two dimensional structure of an mRNA molecule affects its function — there’s a beautiful example of this summarized in the September 28 issue of Nature in which a downstream stem loop structure causes a stop codon to be translated into the atypical amino acid selenocystein — but these situations are apparently rare.
This simple view of mRNA function may be illusory, and further biological research may reveal that structure plays a key role in regulating splicing or who-knows-what. But for now, as bioinformaticists, we are quite happy to coast along with our simple, sequence-based models of transcripts.
For proteins, on the other hand, sequence is just the starting point. Scientists travel great distances to characterize other attributes of these molecules, which are often the essence of the biological phenomenon under study. There is no way we can squeak by in proteomics with the simple data models that have served us so well before.
Protein structure is one important attribute that must be represented in a proteomics database. In many cases proteins can adopt multiple different shapes, or conformations. Often, a change in conformation drives a change in function. Take prions: In one conformation, these proteins play a benign role in the cell. But in another, they become heritable disease factors associated with devastating illnesses such as mad cow’s disease.
Less dramatically, transmembrane receptors — whose job is to transmit signals across the cell membrane — often operate by changing the shape of their “output” end when a signal arrives at their “input” end.
I don’t imagine that a proteomics database would store structures per se, since our friends in structural informatics do this job quite well. But the database must represent the fact that proteins can have multiple conformations and keep track of them.
Another important class of attributes is the enormous variety of reversible chemical modifications that proteins can undergo, such as phosphorylation, glycosalaton, acetylation, and on and on (generically called post-translational modifications). The ExPASy (Expert Protein Analysis System) Web site of the Swiss Institute of Bioinformatics lists 27 post-translational modifications that are handled by the software tools available at their site. Many biological processes proceed by applying or removing these types of modifications to or from particular amino acids in the proteins involved in the process. For example, signal transduction — the process of conveying signals from one part of the cell to another — generally works by successive phosphorylation of the proteins in the signaling pathway. Proteins called kinases and phosphatases are the players that carry out these modifications. A proteomics database has to represent these changing modifications in order to faithfully reflect these sorts of biological processes.
There are a few Web sites devoted to post-translational protein modifications. Amos Bairoch’s page of life sciences Web links (a must-visit!) lists four such sites. Two of the four sites are devoted to glycosolation, one covers phosphorylation, and the fourth, operated by the National Biomedical Research Foundation, covers a broader range of modifications.
I checked out two of the sites whose databases were easy to download to get a quick sense of completeness. One, O-Glyc- Base, contains information on 198 glycosolated proteins. The other, PhosphoBase, contains information on 413 phosphorylated proteins. These numbers are so low that it’s clear these sites are just skimming the surface. There’s a lot more work to be done here.
Another important attribute is the cellular location of the protein. Proteins have to be in the right place to do their jobs. Obviously, a transmembrane receptor has to find its way to the cell membrane to function correctly, while a transcription factor, which regulates the expression of genes, has to make its way to the nucleus. The movement of proteins from one part of the cell to another, or even from one cell to another, is a key aspect of many biological processes.
Errors in localization can lead to disease. For example, one model of how Huntington’s Disease works is that a fragment of the mutant protein moves into the nucleus where it doesn’t belong; once there, it wreaks havoc with the activities of various transcription factors.
Further complications arise from the fact that even the sequence of a protein can change over its lifetime. Many proteins have inactive precursor and active mature forms.
A common mechanism for changing between these forms is for a piece of the inactive protein to be removed by a protein-cutting enzyme, or protease. This results in a mature protein whose sequence is shorter than that of its precursor, and, whose structure and chemical properties are generally quite different. For example, apoptosis, the process of programmed cell death, is turned on through successive activation of caspases and other proteins in this manner. The reverse process, in which a protein is constructed by connecting shorter pieces, is also possible, though apparently rare. Gramicidin, a potent, naturally occurring antibiotic, is an example of a protein constructed by joining together smaller, precursor elements.
Proteins circle their wagons
A final issue is that many proteins are “social creatures” and do their work by forming complexes with other molecules. The properties of these complexes can be quite different from those of the individual proteins — in fact, the proteins by themselves may do nothing of interest.
For example, all the major events in the life cycle of gene and protein expression are governed by such complexes. These include transcription initiation, which is driven by a complex that includes RNA polymerase II and numerous transcription factors; splicing, which requires the splicesome; translation, which occurs at the ribosome; and protein destruction, which is the province of the proteasome. A general-purpose proteomics database has to represent protein complexes as well as individual proteins in order to capture the essential biology.
The biotechnology for identifying proteins is also quite different from the methods used with transcripts. The method of choice these days is to use a mass spectrometer to measure the mass of the protein or, more typically, the masses of fragments of the protein. It turns out that different amino acids have different masses, ranging from a low of 57.0519 mass units for glycine to a high of 163.1760 mass units for tyrosine.
It also turns out that each post-translational modification changes the mass of the protein by a well-defined amount; for example, phosphorylation increases the mass by 79.9799 units. If you know the sequence of a protein, and if there are no post-translational modifications (a big if, but see below), you can calculate the mass of the protein by simply adding up the masses of its amino acids.
A simple protein identification method is to use a mass spec to measure the mass of the protein of interest, compare the measured mass to a list of the masses of all known proteins, and find the best match. This won’t work in a big genome, because too many proteins have similar masses.
The solution is to cut the protein into fragments with an enzyme that cuts at specific amino acids, measure the masses of the fragments, and compare these to a list of computed masses for computationally cut protein fragments. This method, called mass fingerprinting, works a whole lot better, because even in large genomes, few proteins have similar masses for very many fragments.
Post-translational modifications complicate the process considerably. When comparing the measured fragment masses to the list of computed masses, the software has to take into account the possibility of various post-translational modifications.
You can find software that implements these ideas and numerous variations thereof at several Web sites. These include the ExPASy site, the PROWL site at Rockefeller University, and the PeptideSearch site at the European Molecular Biology Laboratory in Heidelberg.
All of this adds up to a lot of work when it comes to creating a proteomics database. We need to represent, as first class data objects, the multiple states, shapes, forms, and locations that each protein can take. We also need to represent complexes with the same vigor as individual proteins. In most cases, the available data will only specify a few of these attributes, and the database will have to cope gracefully with missing information.
Missing information is an issue throughout bioinformatics (I touched on this in October), but it will be a more pervasive problem in proteomics for the simple reason that there’s a lot more information we want to know about proteins. The simple sequence data models that served us so well in the staid worlds of genomics and transcriptomics just won’t cut it in the frontier of proteomics.
In the near term, most proteomics projects will only deal with a subset of these complications. This will give folks some time to figure out which aspects of the problem are most important, and to develop good ways to represent the relevant information.
Proteomic informatics is going to be a hard problem. For now, you’re going to have to blaze your own trail. The good news is that if you do something great, it will have a big impact on those who follow in your footsteps.
Picks and Shovels for Proteomics Pioneers
ExPASy (Swiss Institute of Bioinformatics)
PROWL (Rockefeller University)
PeptideSearch (European Molecular Biology Laboratory)
Amos Bairoch’s Links
O-GlycBase (The Technical University of Denmark)
PhosphoBase (The Technical University of Denmark)