To get computational biologists to speak the same language, the Bio-Ontologies Consortium is borrowing abstractions from AI.
by Dennis Waters
The world of software standards is a fountainhead of cynical humor, with the typical punch line being something like “I’m fully in favor of standards as long as they’re mine.” as standards-setting committees often appear to make decisions based on politics, ego, or money instead of technical elegance or operational practicality, the attitude is justified.
“Competing standards” might be an oxymoron. Nevertheless, it’s the reality in many disciplines. Bioinformatics is no exception.
Somehow a field with only a few thousand serious practitioners worldwide has managed to spawn more than a half-dozen standards efforts, each with its own agenda. These range from the well-organized and well-funded Life Sciences Research Task Force of the Object Management Group to the European Bioinformatics Institute’s BioStandards project to more informal and focused groups like BioPERL, BioJava, BioXML, and BioPython.
But no inventory of life sciences software standards efforts would be complete without the Bio-Ontologies Consortium (BOC), which hopes to trump all of the above. The prime evangelist for the BOC these days is Eric Neumann, 41, vice president for life science informatics at 3rd Millennium, a consultancy in Cambridge, Mass. It might seem an unlikely calling for a fellow whose other passions include playing rockabilly on his bass guitar and paddling a canoe, but Neumann is driven.
Putting Philosophy into Computational Biology
BOC’s history dates to the 1998 Intelligent Systems for Molecular Biology meeting in Montreal, when Neumann, then employed by NetGenics, along with Robin McEntire of SmithKline Beecham, Peter Karp of SRI, and others, began looking at the bioinformatics standards problem from a more abstract level.
Formed initially as a study group, BOC today comprises about 50 people from industry and academia. It is supported in part by AstraZeneca and Glaxo Wellcome and, according to Neumann, is getting positive feedback from the pharmaceutical and academic communities. (A companion effort, co-chaired by Neumann and Vincent Schachter of Hybrigenics in Paris, is the Bio-Pathways Consortium, which aims to clarify the world of pathways.)
While “ontology” is one of those terms from Philosophy 101 that most people would just as soon forget, BOC derives the term from the world of artificial intelligence. According to an August 1999 BOC white paper, “ontologies are specifications of the concepts in a given field and the relationships among those concepts.”
In other words, while ordinary software standards deal with things like procedure calls and data models and field specifications, ontologies attempt to formalize something more abstract: the fundamental concepts of a field. In principle, a well-constructed ontology should be able to serve as an umbrella for all other software standards and allow practitioners to avoid some major pitfalls.
Don’t try to map biology to code directly,” says Neumann, a former fly guy with a PhD in neurobiology from Case Western Reserve. “You’ll shoot yourself in the foot assuredly. Find the basic set of data structures you need to work with and the richer information ¯ how these things relate ¯ captured in another metadata structure like an ontology.”
“If you try to capture the structure in a specification like an API and it gets hard-wired into a program, sooner or later you’ve got to break that program and recompile it with new information,” says Neumann, citing nameless former employers who’ve made the mistake. “I’ve been with companies that have done that and they don’t know how to modify even the smallest piece without everything in the development process breaking down.”
Tailoring a Tongue
Every good ontology needs a language and in its early days BOC focused on sifting through existing languages to find the one best-suited to the needs of bioinformatics. Not surprisingly, none fit the bill exactly, so the group decided to create its own. The result was XOL, the “XML-Based Ontology Exchange Language.”
Karp and SRI colleagues Vinay Chaudhri and Jerome Thomere created XOL in 1999. It was inspired by the venerable Ontolingua language, but with XML-based syntax in place of good old LISP. It also owes something to the Ontology Markup Language (OML), differing chiefly in its use of Open Knowledge Base Connectivity (OKBC-Lite) for its semantics instead of OML’s conceptual graphs. (Details can be found at smi-web.stanford.edu/projects/bio-ontology/).
By remaining at an abstract level and not hard-coding biological detail, BOC’s efforts in life sciences may in time supersede even those of well-established groups like the OMG, Neumann maintains. That’s because the Common Object Request Broker Architecture (CORBA) middleware promoted by OMG is not the ideal infrastructure for ontology exchange in biology, according to Neumann and others.
CORBA’s object orientation depends on object bindings, which must be specified beforehand, so its weaknesses become apparent when new and ill-defined information is at stake. From a scientist’s view this is precisely the most interesting information. “If you want to ask what’s a gene and what’s not, I can almost say with certainty the CORBA IDL spec will never capture it,” Neumann judges.
It’s perhaps no accident that while CORBA began gaining popularity four years ago, very few life sciences databases have been implemented with it. Karp says he can only think of two, both at the European Bioinformatics Institute, one for radiation hybrids, another for expression clusters. He also notes that with its genesis in local area networks, its performance over the Internet may not be so good.
Code That’s Fresh as Milk
While Neumann acknowledges that his employer clearly stands to gain from his extracurricular activities ¯ “3rd Millennium benefits from having standards because we can integrate them for our customers” ¯ his commitment to the consortium is motivated by a personal interest in collaborative, “big-picture-type,” open-source projects. He names Tim Berners-Lee, inventor of the World Wide Web, as a role model.
What’s going to keep the ontology open? A tenet of the BOC is that applications should be licensable, so that they can be closed at the content level while remaining open at the infrastructure level. The idea is modeled after Berners-Lee’s World Wide Web Consortium, which dictates that, for example, a Web page’s HTML source code is open for all to see, but the source code of a Java applet embedded in the page is not. Within life sciences research Neumann cites BOSC (the Bioinformatics Open Source Conference) as a good example of open-source collaboration.
It was ultimately this view that, late last year, caused Neumann to leave NetGenics after a two-year stint and move into consulting. He says his departure resulted from a “political war” over the company’s strategic direction, which is to focus on proprietary software development.
Says Neumann: “There’s no advantage to holding source code because now within six to nine months your code is not useful any more. Software doesn’t work like wine. It works like milk. It doesn’t accrue value over time. In this field you need milk.”
-With reporting by Jennifer Friedlin and Potter Wickware