In my last column I assessed recent technological advances led by W3C and concluded that we are headed toward a global collaboration environment for biology powered by the Semantic Web. This environment would take the form of increasingly integrated, distributed biological knowledge bases.
Despite much recent progress in technology, certain challenges associated with the semantic integration of biology are non-technological, or semi-technological. This article will look at some issues that must be resolved if we are to make the semantic integration real in life sciences.
What kind of knowledge base?
Suggesting that integrating diverse sources of biological findings can result in a knowledge base requires that we reframe its classical AI-based definition. Classical knowledge bases are part of an expert system — capable of reasoning to conclusions comparable to those of a human expert — on the basis of internally consistent domain knowledge in axiomatic form. But the classical knowledge base can only draw such conclusions (based on first-order logic) if: (1)the knowledge engineer has extracted all the relevant knowledge from real experts; and (2)the knowledge base is internally consistent (no self-contradiction). This process of maintaining internal consistency is called “truth maintenance.”
Here is the problem: biology is not deductive. It is inductive. The truth-value of any statement in biology is always, therefore, at least somewhat relative. For example, nothing can be more absolute than a “dogma,” but the “central dogma of molecular biology” itself was relativized — requiring modification — by the discovery of retroviruses. Nature is infinite, and therefore surprising.
The process of arriving at large, integrated truths in biology is typically based on a mosaic of findings from research teams around the world.
Let me give an example from the field of Alzheimer’s research. Many important research results have been achieved in this field in the last 20 years. At this point, while the association of several important genetic mutations with various forms of Alzheimer’s has been demonstrated, we are far from exhausting possibilities. Recent advances come from studies in flies, worms, cell biology, imaging, pharmacology, and more. But there is still no common agreement among all researchers on the causation of late-onset Alzheimer’s.
In active research areas, data "underdetermines" theory: truth has not been fully established. The goal of a knowledge base in biology must be to connect disparate findings from researchers in the same or different fields. Essentially, the goal is to establish a social dialogue, or a virtual collaboration, in which people are able to use each other’s research results without necessarily knowing one another. In this view of a biological knowledge base, truth maintenance needs to be established over the provenance of ideas, and the terms of reference of these ideas, but not over the ideas themselves.
In this respect, PubMed is a very significant knowledge base. It presents a snapshot of “truth in the process of evolution.” There are no axioms, just a body of scientific knowledge and techniques, infinite nature, and the ability to experiment and communicate.
This is to suggest that in search of proper formalisms for integrated biological knowledge bases, we look toward including human interaction, dialogue, and activity in the model. In this way we begin to include a realistic epistemology reflective of actual science.
PubMed is a centrally curated resource. It has its own ontology, consisting of the Medline identifiers on articles, the bibliographic metadata published with each article, and the MeSH terms assigned by NLM indexers.
But can we have non-centralized, i.e. distributed, knowledge bases? That is, for example, can an “alternative” PubMed assign its own ontology to articles, and can we get these knowledge bases to interoperate in a constructive way? This is not an idle question. The 2005 Database Issue of Nucleic Acids Research published information on 719 biological databases, an increase of 171 over the previous year.
So how do we make it possible to integrate the heterogeneous ontologies of these several hundred sources of information? That is what is required to integrate their information deeply, beyond the level of click-through navigation of Web pages. Possible approaches include: a single ontology good for all of them, hierarchical clustered ontologies, or multiple overlapping specialist ontologies. This problem has been studied extensively in commercial information systems.
We must work with what we have, and as a starting point, incomplete ontological mappings in biology may be OK. It may be highly effective merely to map several disjoint ontologies to a document, or to claims made in a document. If the point is to assist people in forming virtual collaborations, in being “surprised” by finding new research in another field relevant to theirs — then disjointness in schemas laid over text is not so bad, for it is resolved in some ways by the document itself.
Other issues include data provenance, reproducibility of research data and computation, trust and security policy, and not least, socialization.
Socialization of any integration approach means getting it adopted and made part of the practice of the community. As we know, it is not always the best technology that is adopted most widely. Often, it is the one that is easiest to socialize — whether through lower price, better distribution channels, superior marketing, and so forth. Socialization has a very important implication for large-scale integration, which requires a broad set of ontologies that need to be developed and curated — an expensive proposition.
What would be ideal is for all curation to be done by the document’s authors. This would be analogous to GenBank’s transition to direct submission. It was powered by the decision of journal editors publishing sequence data to refuse publication without a GenBank accession number, which can only be gotten by submitting acceptable data in a particular schema. That may not be so far off, I believe, for other kinds of publications, concerned with more general knowledge content. This would be particularly true if the schema had properties that enabled researchers to more efficiently structure and manage their own data.
Tim Clark is director of the Center for Interdisciplinary Informatics at the Mass General Institute for Neurodegenerative Disease. He was formerly chair of I3C, which developed the LSID specifications. Tim can be reached at [email protected]
W3C (2005) W3C Semantic Web Activity (URL: http://www.w3.org/2001/sw/)
Crick, FHC (1958) in Symp. Soc. Exp. Biol. “The Biological Replication of Macromolecules”, XII, 138
Temin HM, and Mizutani S (1970) Nature 226: 1211
Crick, FHC (1970) Nature, vol. 227, pp. 561-563
Galperin, M.Y. (2005) Nucl. Acids Res. 33: D5-D24
Visser, PRS, et al. (1997) in AAAI 1997 Spring Symposium on Ontological Engineering.
Guarino, N (1998) in Proceedings of FOIS’98, pp. 3-15.