Wouldn’t it be nice if you could load ligand, receptor, signal transduction, gene regulation, and metabolic pathway databases into your local database — and then be able to traverse them from a ligand binding to a cell surface receptor, through a signaling pathway that turns on a gene for an enzyme that catalyzes a metabolic reaction? We’re not quite there yet, but an effort called BioPAX is making strides in turning this into a reality.
BioPAX is a community-based initiative founded to create a formal standard for data exchange and representation of biological pathway and their nuances. When there were only a handful of pathway databases, writing a few parsers was not an issue. But as the number and type of databases began to increase, researchers rallied to head off the impending integration nightmare. What emerged was BioPAX, whose ontology was created to enable integration of the increasing number of pathway databases that began mushrooming about five years ago.
The Problem with Pathways
That biological pathways are central to biomedical research is no surprise to this community; they are the scaffold upon which we build our knowledge about biological mechanisms.
Pathway data is typically divided into metabolic pathways, molecular interactions, gene regulation networks, and signaling pathways. Metabolic pathways are characterized as a series of enzyme-substrate-product reactions. Molecular interactions, such as protein-protein interactions obtained from yeast two-hybrid experiments and used to identify the interacting components of complexes, are usually simplified as simple binary interactions. Meanwhile, gene regulation pathways show interactions between transcription factors and the genes whose transcription they activate or repress. And signaling pathway representations, the most varied, range from vague and general representations such as ‘There’s an activation chain in which A activates B activates C’ to specific and detailed representations involving a series of complex binding reactions and protein post-translational modifications.
PathGuide, an online list of pathway resources, contains more than 200 biological pathway resources, and the list continues to grow in number as the databases grow in size. But how much of this pathway data is useful? If you’re a biologist and need to order reagents for your experiment, or you are new to the subject and want a quick learn, then a visual representation of a pathway is most helpful. It is easy to understand, informative, and, after all, a picture is worth a thousand words. On the other hand, if you are a bioinformaticist or a research scientist and want to integrate these pathways into your knowledge base, then you’d rather have the thousand words. Images can’t be read with computer programs, so they’re essentially useless when it comes to computation.
Currently, to consolidate the knowledge required for many research projects, one must extract the relevant pathway data from each database, transform it into a standard data representation, and load it into an integrated repository. If you want to navigate through databases, then your best bet — and, in fact, your only hope at the moment — is to use gene and protein IDs.
But you still have a lot of work to do if you want to integrate data from different databases, because the semantics of those databases are not the same: different terms are used to describe the same thing, or in some cases the same name is used for two things. Without a human to do the mapping, there’s no way for a computer to figure out the connections.
And that’s where BioPAX comes in. Its main goal is to create a formal representation whereby multiple types of pathway conceptualizations can be brought together into one framework. So far, only a few databases export their data in BioPAX format, allowing the aggregation of these data and queries across the datasets. Stanford has created the first public resource based entirely on BioPAX data, the Pathway Knowledge Base, which aggregates data from BioCyc, KEGG, and Reactome.
That’s one type of integration, the aggregation of data. It is based on being able to tell whether two things are the same or different by comparing their database identifiers. The mapping of those identifiers, however, is done by humans; integration based on matching identifiers doesn’t embed much in the way of semantics. For example: here’s a BioCyc pathway, here’s a KEGG pathway — are they the same? By contrast, an example of semantic integration is the ability to infer whether things are the same based on their descriptions. One plus one may equal two, but it’s not identical to two. So is the reaction A + B ß‡ C the same as C ß‡ B + A?
We would also like a reasoner to be able to infer and properly map different levels of descriptions for the same entity. This would enable two sources with different levels of detail to be integrated, such as a database with protein-protein interactions with a database of kinases, or a database of chemical compounds structures with a database of reactions that involve those compounds.
BioPAX is also wrestling with more subtle issues in biological representation, such as how to handle incomplete knowledge of mechanism, the combinatorial explosion of protein states, and ambiguous representations such as polymerization reactions.
Currently, pathway data can be exchanged and aggregated in BioPAX format, but we’re still working through representational issues of pathways in OWL in order to use an automated reasoner to determine whether two things are the same or different based on their descriptions rather than their identifiers. Furthermore, we want a reasoner to be able to point out when experts don’t agree. BioPAX will never be able to resolve the differences of opinion, of course, but it can highlight them so that someone with the right skill and initiative can undertake the experiments that would be needed to resolve the disagreement. OWL, description logic experts, reasoners, domain experts (such as the database providers), and knowledge engineers and users are all required, and that’s why BioPAX is a community effort — one that hopes to provide researchers with a pathway to pathways.
Joanne Luciano is a lecturer in the Genetics Department at Harvard Medical School, a visiting research fellow at the University of Manchester, and president and founder of Predictive Medicine. She is a co-organizer of the BioPathways Consortium and a member of the BioPAX Workgroup.
Bader, G. D., M. P. Cary, et al. (2006). “Pathguide: a pathway resource list.” Nucleic Acids Res 34(Database issue): D504-6. http://pathguide.org