Nat Goodman spends a philosophical day at the beach
Sitting at the water’s edge on a hot summer day, refreshing drink in hand, my thoughts drift lazily to the wonder of life and the molecular pathways that make it possible. The children frolicking in the waves bring to mind the proteins and other molecules whose carefree play in our bodies gives rise to the pathways of life. No one tells the children or the molecules what to do. They just do it and life unfolds.
The scientific challenge is to find the patterns in this haphazard fun, to learn which molecules play nicely and which ones fight, and ultimately provide step-by-step explanations of diseases and other phenomena.
It seems obvious that informatics will take a leading role in all of this, but it hasn’t happened yet. It’s not for want of trying. Pathway-related websites are swarming like annoying biting bugs on the beach, but most are pretty weak.
I thought this month we could take a look at what pathways are about and see what those websites do. Grab a beach chair, slather on the sunscreen, and arm yourself with bug spray. Here we go.
A pathway is a step-by-step, mechanistic description of a molecular process. Let’s look at an example from Huntington’s Disease, a fatal, neurologic disease caused by an expanded CAG repeat in the huntingtin gene. This mutation gives rise to an expanded poly-glutamine repeat in the Htt protein. No one really knows how the HD mutation leads to disease and death, but one hypothesized pathway goes like this: Mutant Htt is cut by caspase-3, and possibly other proteases, generating a short fragment that contains the expanded poly-glutamine repeat. This reaction occurs in the main body of the cell, called the cytoplasm. The fragment travels to the nucleus of the cell either by diffusion or through some unknown active transport mechanism.
Once in the nucleus, the Htt fragment binds to CREB binding protein (CBP), an important transcription factor, and may also bind to other such factors. The bound proteins form clumps, called aggregates, similar to those seen in Alzheimer’s, Lou Gehrig’s, mad cow, and many other neurologic diseases. Transcription factors stuck in these clumps cannot regulate their normal target genes, and the cell is unable to turn the correct genes on or off at the correct times.
This mistake causes the cell to malfunction in unknown ways and ultimately triggers programmed cell death through mechanisms that are also unknown. Though not illustrated in the example, pathways can include branching, merging, parallel paths, and cycles.
What’s really going on in nature is more complicated than the pathway description would suggest. Proteins are running and jumping, throwing balls and kicking sand, holding hands and letting go in the chemical ocean inside our bodies. A pathway is a frail human effort to ascribe some order to this chaos. “Charlie hit Davie ’cuz Bob and Barb splashed Charlie ’cuz Alice wouldn’t throw the ball to Bob ’cuz Davie teased Alice ’cuz …”
Combing the protein shore
Biologists categorize pathways like beachcombers catalogue seashells. There are a number of special ones.
Metabolic pathways are one important, and well-studied, case. These pathways convert the hot dogs and ice cream we eat at the beach into the “stuff” our cells need — starches, fats, amino acids, energy carriers, and other simple compounds. Unlike our HD example, metabolic pathways use proteins only as catalysts, called enzymes, and do not modify or consume them.
Signaling pathways are another important case. These pathways transfer and process information, telling us when it’s time to get that hot dog. A common scenario involves the transfer of information from a cell’s external environment to the nucleus where the information is used to turn genes on or off. There is a nice discussion of common themes in such pathways in a recent Nature paper by Julian Downward.
Information moves from receptors on the cell surface to transcription factors in the nucleus through a series of reactions that change the states of proteins along the path. These reactions typically involve the attachment of chemical groups, called phosphates, to specific sites on the proteins. Along the way, the pathway can “compute a function” on the signal, for example by requiring that signals from two receptors be present in order to activate the next protein along the path.
A genetic network is a pathway in which a signaling pathway turns on or off some genes that then regulate other genes or affect some other pathway of interest. A commonly studied situation is a signaling pathway that controls a metabolic pathway by regulating the genes that make the proteins that catalyze the metabolic reactions.
Oceans of data, deep-sea problems
Each step in a pathway represents a conclusion drawn from one or more experiments. A pathway as a whole may summarize years of research from dozens or hundreds of scientists.
The articulation of a pathway is itself a creative scientific act. Our example pathway comes from an excellent survey paper by Jang-Ho Cha augmented with recent results on the role of CBP from Christopher Ross.
A pathway represents biological knowledge. It’s not mere data. A pathway is more like an entry in OMIM than it is like a sequence in GenBank.
Some pathways, especially those from classic metabolism, are so well established as to be fixtures. But most pathways of active interest are highly speculative and will certainly change.
People often mention protein-protein interaction data in the same breath as pathways. Such data can be generated in copious quantities using remarkable high-throughput technologies. Reminiscent of Incyte’s business model during the heyday of ESTs, several companies are using these technologies to produce large, proprietary databases for resale. Examples include AxCell, Hybrigenics, and Myriad.
A big problem with this type of data is that “interactions” are not the same as “reactions.” Current high-throughput technologies only measure pair-wise interactions, and totally miss any reactions that require more than two participants, such as the formation of molecular complexes. In addition, these methods usually work on short protein fragments, such as individual domains, and totally lose the important effects of protein context.
A real glaring problem is that interaction data doesn’t tell you what happens when the proteins interact — for example, which proteins change state. Indeed, the whole issue of protein state is largely ignored in interaction datasets, with most of the data collected on protein fragments in their native states.
Finally, current high-throughput techniques have high error rates, both false positive and false negative.
For all these reasons, you have to take interaction data with a large dose of saltwater. Just because you see an interaction does not mean the proteins participate in a real reaction.
If you see a linked series of interactions — protein Alice plays with Bob who plays with Charlie who plays with Davie — it may suggest that these guys are in the same playgroup, er, I mean pathway, but it does not mean there’s a pathway connecting Alice and Davie. If Bob and Charlie truly react, the effect will likely be to change the state of one or the other, say Bob. But the data connecting Alice to Bob was generated from Bob’s native state, not his changed one.
Interaction data seem destined to become the ESTs of proteomics: a voluminous source of crummy data! I’m sure that high-throughput interaction data (like ESTs before them) will be extremely valuable as starting points for laboratory studies. But I bet that computational analyses of these datasets will end up producing nothing more lasting than an elaborate sandcastle.
Pearl in the Website sand
Most pathways sites are essentially repositories of pathway diagrams and descriptions. These sites let you browse a list of pathways, search for a pathway by name, or search for pathways containing specific proteins or small molecules. What you get back is a pathway diagram.
In some sites, the pathway diagrams are hand-drawn cartoons like the ones that decorate journal articles, while in other sites they are computer generated. The diagrams are usually clickable. For metabolic pathways, you can click on compounds or reactions. For signaling and other protein-oriented pathways, you can only click on proteins ¯ not reactions. Clicking on a protein takes you to the usual sequenceoriented sites. There seems to be no way to learn about the reactions.
Most sites focus on well-known textbook pathways. None of the major sites support pathways from active research areas, such as our HD example, in which incomplete information is a major concern.
Support for metabolic pathways is far more advanced than for other sorts. There are several new sites focused on signaling and other non-metabolic pathways, but they don’t contain much data.
KEGG is the gold standard of pathway websites. It was one of the first to be established, and is one of the few to be actively maintained over the years. Many other sites grab most of their data from KEGG.
There is a growing number of protein-protein interaction sites, both public and proprietary.
The public sites contain rather small datasets culled from the literature. These sites generally let you browse or search for a protein of interest, then present a list of interacting partners.
Many sites let you visualize the interactions graphically using a mesmerizing “dancing squares” display. Everyone uses the same display — there must be some free Java code that implements it. It’s very cool to watch, but actually contains little useful information. A text list of interactions is probably more useful.
A few sites provide software tools to search for paths in the interaction data; for example, to discover whether two proteins of interest are connected by a sequence of interactions. I question the utility of such tools, as the existence of a path does not come close to implying the existence of a pathway.
Watching the tide
The current genre of pathway websites is passive repositories that record only the most stable pathways. Useful, I suppose, but about as interesting as watching the tide going out. More exciting would be databases that could chronicle the competition among alternative pathways, and track the changing consensus.
Software for simulating pathways would also be quite handy. I hope to review this area — including the products from Physiome and Entelos — in a future article. It would also be great to have software that can infer pathways from data. This is an active research area that looks really difficult.
In the meantime, we might as well enjoy the beach and describe the kids and pathways the old fashioned way — with thoughtful words instead of software.
Boehringer Mannheim Biochemical Pathways
Cell Signaling Networks Database (CSNDB)
Encyclopedia of E. coli Genes and Metabolism (EcoCyc)
Kyoto Encyclopedia of Genes and Genomes (KEGG)
Metabolic Pathways of Biochemistry
Roche Apoptosis Pathway
Signaling Pathway Database (SPAD)
Biomolecular Interaction Network Database (BIND)
Biomolecular Relations in Information Transmission and Expression (BRITE)
Database of Interacting Proteins (DIP)
CHA PAPER: Transcriptional dysregulation in Huntington''s disease. Cha JH, Trends Neuroscience, September 2000, 23(9):387-92
DOWNWARD PAPER: The ins and outs of signaling. Julian Downward. Nature Vol. 411, June 14, 2001. pp. 759-762
ROSS PAPER: Interference by Huntingtin and Atrophin-1 with CBP-Mediated Transcription Leading to Cellular Toxicity. Frederick C. Nucifora Jr., Masayuki Sasaki, Matthew F. Peters, Hui Huang, Jillian K. Cooper, Mitsunori Yamada, Hitoshi Takahashi, Shoji Tsuji, Juan Troncoso, Valina L. Dawson, Ted M. Dawson, Christopher A. Ross. Science Vol. 291, Number 5512, 23 Mar 2001, pp. 2423-2428
Protein Pathway Players