Not only has the Gene Ontology established itself as the de facto standard for describing gene products in databases, it’s spawned some of the funkiest acronyms in bioinformatics: GOBO, EGO, AmiGO, GOFish, GOAT, GO2PDB, and a number of other projects have GO to thank for their etymological, as well as ontological, roots. Now, in collaboration with the GO Consortium, a team of researchers from the University of Manchester has added a new project to the list: GONG (Gene Ontology Next Generation), which aims to extend the functionality of the current GO vocabulary using the DAML+OIL (DARPA Agent Markup Language + Ontology Inference Layer) language for representing ontologies.
Members of Robert Stevens’ Manchester team have been diligently working on methods to smoothly translate GO terms into DAML+OIL, and detailed their work at the Pacific Symposium of Biocomputing in early January (the paper is available via the PSB 2003 online proceedings: www.smi.stanford.edu/projects/helix/psb03/wroe.doc). Lead author Chris Wroe recently spoke to BioInform to put some plain English behind all the alphabet soup.
What was the motivation for GONG? Is something missing from the Gene Ontology?
Well, I wouldn’t say anything is really missing [from GO]. Because it is so popular, it’s starting to be used by a lot of people, and it’s getting larger and larger. And what you find is that a lot of the terms within the Gene Ontology are actually becoming phrases, and some of those are quite simple, so you could have something like heparin metabolism. The problem is that it’s essentially a combination of things — it’s a process acting on, say, a chemical, and as soon as a new set of chemicals comes along you have to enumerate all of the different processes that can act on them, and you can get sort of a combinatorial explosion of the number of terms that you need to describe something. So what DAML+OIL allows you to do is what some people have called ‘conceptual Legos’ — you can break all the terms apart into smaller bits and then put them together into phrases so that you’d have a smaller term for the kind of chemical and for the kind of process. And then either the GO editorial team can put these phrases together as they need them, or you can wait until you need to use a phrase in annotation and build terms on demand.
It’s trying to address that combinatorial explosion in a way that also allows computers to interpret the phrases. So we have software called description logic software that will interpret these phrases, and it will do two things for you: It will make sure that all the phrases you put together are consistent with each other, so you haven’t brought in any contradictions or any sort of logical nonsense; and it will also allow you to organize the phrases in hierarchies. So, rather than the GO annotators having to meticulously place all of these long conjugated phrases in lots of different places in the hierarchy, computer software can assist in that.
A few years back there was some debate over whether GO was technically an ‘ontology’ or not, but proponents argued that it worked just fine as it was. This project seems to be a peacemaker between those two camps.
That’s the plan. Manchester’s got expertise both in the computer science and in the bioinformatics side, so we’re sort of trying to bring the two sides together. One of reasons it’s been successful is we’ve got a close relationship with the GO Consortium, so it’s really been in cooperation with [GO founder and coordinator] Michael Ashburner.
How did you determine which were the most important phrases to be broken down into ‘Lego blocks’ first?
It started off as a feasibility study, so what we’ve done is take some of the simpler phrases and just concentrate on those. One of them is metabolism, and that fits quite neatly. What we’ve found is we can automate the process of migrating it to DAML+OIL because a lot of the phrases are very stereotyped — in the form of some sort of chemical and then the kind of metabolism that’s occurring. You can actually automate, at least to produce a candidate description in DAML+OIL, which can then be validated by someone.
It’s not possible to do it as one big bang so that one day it’s in the original GO format and the next day we’ll deliver it in DAML+OIL. The only way it’s going to work is if we can evolve it, and take small sections at a time, to produce more formal definitions for these areas, but still do it within the existing Gene Ontology framework.
Have you come across any challenges in working with the Gene Ontology’s existing format and hierarchy?
Not so far. It’s been fairly consistent in its view of biology, so we’ve not had any major problems. There are certain naming issues, because it’s been designed specifically for biologists. Not particularly in my work, but linking the Gene Ontology into other terminologies, like UMLS [Unified Medical Language System]. They’re currently trying to integrate the Gene Ontology into UMLS, and because the Gene Ontology takes a particular view of biology — some of the enzyme names occur under molecular function — if you have some sort of hydrolase, when somebody not used to the Gene Ontology looks at that, they assume it’s referring to the actual enzyme, where in fact it’s referring to the catalytic capability of the enzyme, the functional capability, and that causes some problems when you try to relate it to other terminologies because the same term is used in two different contexts.
How many people are working on GONG?
It’s a mixture. The project’s focus has been on developing the software to undertake the process. It’s funded by the DARPA [Defense Advanced Research Projects Agency] DAML program, so it’s actually funded outside bioinformatics. The Gene Ontology was just supposed to be an example of how you could do this. But actually, because we’ve had such a close relationship with the Gene Ontology Consortium, it’s turned out that as well as doing that, we’re actively concentrating on Gene Ontology content. So in Manchester we sort of have a technical focus. I’ve worked on some of the automated techniques for at least producing candidate descriptions, and we have one other technical programmer working in Manchester, and it’s overseen by Robert Stevens. And we interact with the GO editorial team. Jane Lomax [at the European Bioinformatics Institute] actually responds to any changes that we suggest based on the process that we’ve been through. We get reports based on the software that we use to examine the definitions we create, and they suggest additional hierarchical links. Mike Ashburner is also involved in it.
You mentioned the description logic software. What other software are you using?
The description logic is sort of generic software that’s used to examine conceptual Legos in DAML+OIL, and it’s been developed independent of the Gene Ontology work, that’s been developed jointly by European and US teams, and it is used in areas such as the semantic web. But that’s very generic technology, so what we’re trying to do is translate some of those tools into a specific Gene Ontology area. The GO Consortium has already got a lot of ontology development tools, and we’re trying to take generic DAML+OIL tools and the specific Gene Ontology tools and try and meet somewhere in the middle.
How far along are you in that process?
It’s early stages at the moment, so we’ve got prototypes, and we’re just really starting the major development work. We can do things like show these more formal definitions of Gene Ontology terms in Gene Ontology browsers, such as AmiGO, but there does need to be a lot of management software because you’re increasing the amount of content in the Gene Ontology, and you need a lot of software to manage that increase in content and keep track of it changing over time.
What does this mean for other projects that link to GO or use GO terms?
The plan would be not to have to enforce any changes in the way people have to access the Gene Ontology. It would just be adding extra capability to what’s there already. GO has been very successful in delivering an ontology that can be used by the community, and I wouldn’t want to add any complexity to that, so the whole idea of the software development effort is really to shield end users from having to interact with the complexity of DAML+OIL. The idea is that eventually it will allow you do more with the Gene Ontology than you can now.
Could you give an example of something that’s not possible to do with GO now, but might be if GONG goes as planned?
I’ll give you a trivial example. Because you’ve created a more explicit definition of what the terms are made up of, if you’re interested in all the gene products that are involved with some chemical in some way, at the moment the chemical information isn’t explicitly represented in the Gene Ontology, it’s just embedded within the term names. But now you could ask for all the gene products that involve a carbohydrate chemical, and because you have an explicit representation of how each Gene Ontology term relates to the chemical, and an associated chemical ontology, you can start to ask those more specific questions.
What are the next goals for the GONG project?
We’ve proved the concept and I think we’ve convinced the Gene Ontology Consortium that it’s a useful technology. The next stage is really two-fold. It’s starting on the software development effort, and it’s also trying to transfer some of the expertise that we have in Manchester to do DAML+OIL ontologies to the Gene Ontology Consortium. Because although we are bioinformaticians, the best people to create these definitions are the Gene Ontology Consortium themselves, rather than us. So it’s only going to be successful if we can transfer the expertise to the Gene Ontology.