IT Guy Nat Goodman takes a hike on the major pathway databases and, in his wanderings, has some time to reflect on what a database should do
I’m lost. I’m standing at the cell membrane and want to travel along the famous MAPK cascade to the nucleus. I had the good sense to bring along a map — actually, several — from pathway databases around the Web, but I’m still lost. One map says to get on the highway at GRB2, go through MAPK, and get off at Elk1 or p85rsk. A second shows the road splitting just after Ras, with one fork passing through ERK before getting to Elk-1, and the other heading through the scenic town of JNK before hitting c-Jun. Another shows GRB2 bound to SHC at the entrance ramp, ERK1 and ERK2 just before the nucleus, and exits at RSK2 and MAPKAPK1C among others; on this map Elk-1 is just an alternate exit reached by cross talk with other pathways. Yet another
has a split after Ras but then shows the forks passing through ERK1 and ERK2 before ending up at places like MNK1, MSK1, and p90-Rsk. My guidebook of genes and proteins, LocusLink, doesn’t offer much help, because many of these molecules aren’t even listed, at least not by the names used on the maps.
I’m sure it would all make sense if I already knew where I was going, but then why would I need a map? I should probably ask that nice biologist over there for directions, but I suffer from Y chromosome disease, so I’ll keep looking at my maps and wandering around until I figure it out. Or not.
There are many pathway databases and websites to choose from, and they come in all shapes and sizes.
Some are big, organized public repositories. The premier example is Kyoto University’s Encyclopedia of Genes and Genomes, which has served the community since the mid-’90s and has more than 10,000 pathways. Another well-known example is Peter Karp’s EcoCyc, which seeks to represent all known metabolic and regulatory pathways of E. coli. EcoCyc has expanded into a more comprehensive database, BioCyc, which has 477 pathways from a variety of species, mostly bacterial.
Others are run by reagent vendors and showcase products as well as pathways: kind of like those Internet driving direction sites that tell you which motels are along your route. BioCarta is the leader here with 285 pathways. Another example is the Apoptosis Special Interest Site operated by Roche Diagnostics.
Others are run by individual laboratories and present information on a few pathways studied by the investigator. An excellent example is James Woodgett’s website of mammalian MAPK signaling pathways.
A few are essentially electronic journals, or supplements to textbooks. Examples include Science magazine’s Signal Transduction Knowledge Environment, and Donald Nicholson’s Minimaps. Nature’s Alliance for Cellular Signaling Gateway sounds like it should belong in this group, but it has no pathway content at present.
Some are collections of diagrams from published papers, including Kevin Becker’s Biological Biochemical Image Database. One, the online version of the Boehringer Mannheim wall chart, is an icon in its own right.
Two are repositories for models expressed in particular mathematical formalisms, namely CellML and SBML. The CellML repository is much larger with about 125 models, compared to SBML’s 18.
I also found a lot of dead or moribund sites on people’s lists of interesting pathway links. I suggest taking a look at the “last modified” dates on sites you come across.
People sometimes mention protein interaction databases in the same breath as pathways. The leading academic databases here are the Biomolecular Interaction Network Database, Biomolecular Relations in Information Transmission and Expression, Database of Interacting Proteins, and Molecular INTeraction database.
Three groups are attempting to establish standards for pathway data. CellML and SBML are mathematically oriented standards driven by the needs of simulation and analysis. BioPAX is concerned with knowledge representation and strives to capture the biological nuance of pathways. I’m told these groups are talking to each other and hope to converge on a common standard someday. Yeah, right! That’ll happen about as soon as I break down and ask for directions. There is also an interest group, the BioPathways Consortium, that organizes meetings and such.
Commercial software efforts are largely focused on mathematical modeling and simulation. The major vendors are Entelos and Physiome, with Gene Network Sciences nipping at their heels. A few new companies have entered the market recently: Cellnomica, Genomatica, and Kenna Technologies. For the simpler task of drawing pathway diagrams, BioCarta provides free Freehand templates, while Stratagene sells a product for doing more elaborate graphical layout and editing of pathways.
I looked at one pathway — the well-known MAPK cascade — in several databases to get a concrete sense of how the sites differ. I wanted to use a signaling pathway, since that’s where the action is nowadays. I chose the MAPK cascade because it’s complex, a lot is known about it, and it was present in several databases.
I compared this pathway across four sites: the Woodgett laboratory website of MAPK pathways, KEGG, BioCarta, and Science STKE. The pathway is also present in the CellML and SBML repositories, but I had no good tools for looking at them in this form.
The MAPK cascade is actually a family of pathways that share a common structure. Signals, called mitogens, arrive at the cell membrane and are detected by a class of molecules called receptor tyrosine kinases. Some pre-processing occurs, perhaps to reject errant signals or stabilize the input. The signal then propagates through a series of kinases until it reaches the nucleus. The specific kinases vary from one pathway to the next, but they are grouped into classes whose names reflect their position in the series: MAP4K, MAP3K, MAP2K, and MAPK. Once in the nucleus, the signal can activate a variety of transcription factors thus regulating the transcription of different genes.
The Woodgett site presents a very high-level diagram backed by a detailed, yet lucid, text explanation. I quickly adopted this site as a high-level guide to orient the other maps.
KEGG provides more detailed diagrams separated by organism (human, fly, and yeast). It also presents information in tables, including lists of homologs for the main classes of molecules in the pathway. It offers no text explanations. I doubt that I could have made any sense of the KEGG maps without the guidance of the Woodgett site.
BioCarta’s diagram is even more detailed than KEGG, but balances the detail with a measure of abstraction (see p. 46). It organizes the diagram along two dimensions: specific pathways occupy different vertical regions of the diagram and are color-coded, and each stage of the cascade occupies a different horizontal slice. Each class of molecules is shown as a shaded oblong enclosing a list of members, in contrast to the separate tables used by KEGG. There is a brief textual explanation, which is better than nothing, but much less informative than the material on the Woodgett site.
Science STKE presents separate diagrams for the three major types of MAPK cascades. The level of detail is comparable to BioCarta, but there is no attempt at abstraction. The diagram includes regulatory details (negatively acting feedback loops, for example) that are completely ignored in the other sites.
To my surprise, the Science site provides no text description of the pathway. Early on, it seemed that they were aiming to provide high-quality perspectives and reviews of pathways, but presently, the site provides such material for only 14 of its 48 pathways.
The user interface is perversely ill-designed. When you put your mouse over a molecule or reaction, details appear in a text box at the top of the diagram; if the diagram is too big to fit on your computer screen, the text box is not visible and you have no way to see the information. The diagram occupies a fixed amount of space on the screen; enlarging your screen beyond that size doesn’t increase the amount of information that’s visible. These problems make the site really hard to use in its current form.
The sites used different names for the same proteins and classes of proteins. In a great many cases, the names were not the ones currently listed by LocusLink as the official gene symbols. This inconsistency makes it hard to compare content by visually inspecting the sites. And since none of the sites (except KEGG) provides a convenient way to download data, it’s tough to do the usual trick of writing a little Perl script to convert names to a common vocabulary.
To compare the content, I manually translated the protein names into their official symbols where possible, and then related everything back to the Woodgett map. The pathways are broadly consistent, but there are many detailed differences. I have no clue which is right. Perhaps all are right under some circumstances. Given this state of affairs, it’s hard to take any site as a definitive source.
Fork in the Road
I’m still lost. Even after translating the maps to the same language and reading the explanations, they’re still different. Yes, they’re describing the same general route, but the details are so different that it seems unwise to use any of these as a definitive data source for further analysis.
Many new analytical methods are being developed that depend on knowing which genes are in which pathways; for example, methods that cluster microarray gene-expression data and then correlate the clusters with pathways. Such methods only make sense if the pathway databases are dependable. We’re not there yet.
Blazing a Path to the Perfect Database
No pathway site does it all. By adding up their capabilities, we can get a sense of everything we might want a pathway database to do.
Like all databases, the system must provide effective ways to search for information of interest. It should be possible to search by biological function, e.g., to find pathways involved in cell division or apoptosis or whatever, using the Gene Ontology or MeSH to specify the function. It should also be possible to search by molecule or type of molecule, e.g., to find pathways that include protein MAPK3 (mitogen-activated protein kinase 3) or more broadly, any MAP kinase. For this to be really useful, the system needs a good dictionary of biomolecules and classes of biomolecules that can cope with the numerous alternate names, spellings, and abbreviations that pervade the pathways’ world.
For example, the system should know that MAP2K, MAPKK, MAP kinase kinase, and mitogen-activated protein kinase kinase are all different names for a class of proteins that phosphorylate MAPKs. It also needs to be fluent in translating names between organisms, since pathways often combine data collected across different organisms.
It would be great to be able to search for pathways in relationship to a given one, e.g., to find pathways that are downstream of a given MAPK cascade. Another neat tool would be a Blast analog that takes a fragment of a pathway and finds ones that match, e.g., to find all pathways that include a kinase cascade.
There should also be a way to browse the database, so you can find information even if you don’t know what it contains. The alternative is to force new users to hunt for data by trying query after query. “Gee, I wonder if the database has apoptosis… No … How about a MAPK cascade? … I wonder how they spell it…” Painful.
The system should provide output in several forms. Diagrams are de rigueur. Text descriptions are also critical to explain what the diagram means and to get into the biology of the situation. And like all databases, there should be a structured form of output that can be fed into other software.
Diagrams are not a panacea. For simple pathways, they give a good overview, but as pathways get more complex, the pictures become inscrutable bowls of spaghetti. The solution is to provide an abstraction mechanism to selectively hide detail. The idea is to represent a group of reactions as a black box, and to depict a complex pathway as a collection of interconnected black boxes. For example, one might represent a MAPK cascade as three black boxes: an input stage that receives a signal at the cell membrane with processing to assure accurate reception, a signaling cascade that transmits the signal to the nucleus, and an output stage that activates appropriate transcription factors. Naturally, there should also be a way to open up black boxes on demand. This should sound familiar to programmers and computer hardware designers who lived through the era of visual programming and design tools.
A practical pathway database must be able to cope with the considerable variation in how much is known about different pathways. At one extreme are the classic metabolic pathways for which every step is understood in great detail, even to the point of knowing the quantitative reaction rates. At the other extreme, we may simply know some of the molecules involved in the pathway and have limited information on which ones interact with each other. Well-studied signaling pathways, such as the MAPK cascade, are in the middle: many molecules and reactions are known, but there’s little knowledge of reaction rates, regulation, and other details. The database should be able to present each pathway in as much detail as is known.
For references and links to tools and papers mentioned in IT Guy columns, visit www.genome-technology.com.
Nat Goodman, PhD, is a senior research scientist at the Institute for Systems Biology and an affiliate professor of bioinformatics at University of Alaska-Fairbanks. Send your comments to Nat at [email protected]