It is traditional to begin the New Year by assessing the old year, and by making predictions. I would like to honor this tradition by pointing to what I believe was a very important set of developments for bioinformatics in 2004, and predicting where these developments may lead us in the next few years. Those at the top of my list are related to the evolution of the Web.
In October 2004, the World Wide Web Consortium, known as W3C, held a workshop on the Semantic Web for Life Sciences, attended by 115 people from 77 organizations. Attendance was severely limited by space, the size of the applicant list having surprised the organizers.
Why was there such an interest in this conference?
It is commonplace now to say that biology has become an information science. This is partly meant to suggest that we have acquired a computationally tractable model of biology. Genomes interlinked with large masses of bibliographic and other information, are, yes, “computable” beyond operations on unstructured text, in ways that relate to at least one aspect of the actual biology — i.e., by computing on sequence representations. We can search the bibliographic corpus, find processed information related to sequence, and compute on a model of the sequences. All this information is via the Web, which is what makes it broadly useful.
Bizarrely, however, the Web is also relatively impervious to intelligent navigation by software agents. This means, among other things, that as knowledge on a topic becomes richer, our ability to integrate that knowledge deteriorates — despite our capabilities in free-text retrieval.
Google — and its milestone 2004 beta release of Google Scholar — is great, but it is just a search engine. It cannot integrate. Even if all the texts in the world were digitized and retrievable, biologists would still be winnowing them by hand, and still making huge piles of printouts with Post-its stuck to them.
Enter the Semantic Web. Biologists want to form, in areas of interest, complete, integrated pictures composed from multiple experimental perspectives — essentially a montage of multiple scientific results at many levels of abstraction. What we want is to extend our knowledge by addressing gaps or contradictions in the montage.
What is the Semantic Web for Life Sciences and how can it help? W3C defines it as “a vision for the future of the Web in which information is given explicit meaning, making it easier for machines to automatically process and integrate information available on the Web.” In other words, the Semantic Web is a web of pages equally understandable to humans and machines that can be computed upon by software agents.
Humans can understand today’s Web pages, but machines cannot. Machines can only understand those parts with defined meaning. Currently the only items on most Web pages with defined meaning are the URLs. If we are to get beyond the search-engine view of Web integration, we must provide metadata concerning what the information on a Web page means. Today this is mostly, as they say, “an exercise left to the reader.” But not for long, in my view. At the Semantic Web conference, it became clear that the technological basis for the Semantic Web for Life Sciences is now mostly in place, much of it having come to fruition during 2004.
There are five essential requirements of the life science Semantic Web. They are support for: identification, interoperability, classification, properties and relationships. These requirements are supported by three main technologies: LSID, RDF (I’m including RDF Schema with this), and OWL.
LSID is the Life Science Identifier standard, a specialized URN namespace, standardized in 2004. It permits any object on be given a permanent, locally defined, globally accessible identifier.
RDF is a W3C data model, significantly updated in 2004, for representing Web objects and their relationships, as ‹subject› ‹relationship› ‹object› triples. These objects can range from bibliographic information, to genes, proteins, transcripts, and metabolites. RDF Schema adds class hierarchies to RDF.
OWL is the Web Ontology language, issued as a W3C Recommendation in February of 2004. It provides facilities for representing an ontology (a specification of the kinds of objects in a domain and their relationships) and making inferences upon it.
RDF and OWL build on familiar technologies, starting with XML. LSID is a form, as noted, of URN namespace — essentially a location-independent persistent URL.
What can we do with these technologies? Let me note just a few examples at various stages of development (check online for URLs):
Representing the Interactome: BioPAX
Organizing digital libraries: SIMILE
Universal information client: Haystack
Transparent distributed workflow: Taverna
Distributed collaboration networks: FOAF
Still to Come
Ultimately what we are headed toward with the Semantic Web for Life Sciences is, in my opinion, a comprehensive global collaboration environment for biology. As we move the “intelligence” for data interpretation into metadata, it becomes increasingly possible to construct integrated knowledge bases with software agents.
But what will emerge will not be, as some think, a kind of super machine to answer all questions in biology. Interpretation of data, as we know, varies — and variation is greater in the most active research areas. If we move interpretation into metadata linked with software, we will bring out all the contradictions, divergent views, and gaps in our collective knowledge. This is a good thing, because we then will have made the most essential raw material for scientific endeavor widely, democratically, and collaboratively available on the Web.
Tim Clark is director of the Center for Interdisciplinary Informatics at the Mass General Institute for Neurodegenerative Disease. He was formerly chair of I3C, which developed the LSID specification. Tim can be reached at [email protected] or [email protected]
Tim’s list of URLs relevant to this column:
Representing the Interactome - BioPAX: http://www.biopax.org/
Organizing digital libraries – SIMLE: http://simile.mit.edu/
Universal information client – Haystack: http://haystack.lcs.mit.edu/
Transparent distributed workflow – Taverna: http://www.mygrid.org.uk
Distributed collaboration networks – FOAF: http://www.foaf-project.org/ http://www.w3.org/2001/sw/
- in Life Sciences (SW-LS)
- Journal Reference: Brief Bioinform. 2004 Mar;5(1):59-70. PMID 15153306.
- Specification http://www.omg.org/cgi-bin/doc?dtc/04-10-08
- Resolver Code http://www-124.ibm.com/developerworks/oss/lsid/
RDF & RDF-S
- Tutorial http://www.w3.org/2000/10/swap/
OWL & Ontology editing
Jena Semantic Web Toolkit: http://jena.sourceforge.net/index.html
Semantic Grid: http://www.semanticgrid.org/documents/sigmod/ami9.html