Head of Stanford Medical Informatics
Stanford University School of Medicine
In late September, the National Institutes of Health announced that it had awarded a total of $56 million to fund three new National Centers for Biomedical Computing. One of these will be the National Center for Biomedical Ontology, which will be headed by Mark Musen, a professor of medicine at Stanford University whose research group created the ontology-development software Protégé.
The center, to be called cBIO, was awarded $18.8 million, and will include the informatics group at the Berkeley Drosophila Genome Project which supports the Gene Ontology Consortium and the Open Biomedical Ontologies library as well as the Mayo Clinic, the University of Victoria in British Columbia, and the State University of New York-Buffalo.
Under the NCBC program, external groups are also encouraged to collaborate with the center through the NIH R01 funding mechanism. Further information about the center is available at http://bioontology.org/.
Ontologies have generated quite a bit of debate in the bioinformatics community over the last few years, and Musen himself said in a statement that he was "surprised" at the award because the center's goals are "probably a bit more cutting-edge than is typical." The NCBC grant appears to be a sign, however, that biomedical ontologies are moving from the fringes of the bioinformatics community into the mainstream.
BioInform spoke to Musen by phone after the awards were announced to get a better idea about his vision for the center and the role of ontologies in the biomedical informatics community.
Can you outline the broad goals of this center?
We're hoping that we can really transform the way ontologies are developed and the way they're used in biomedicine. When you look at the landscape, you see the creation of biomedical ontologies as almost as cottage industry, where you have lots of different groups some in government, some professional, some in academia all working in isolation, generating ontologies that may or may not fit together, may or may not adhere to best practices, may or may not use standard representation systems.
A scientist who wants to know what's out there is often quite confused. You can't just easily use Google to find out what ontologies exist, and more important, once you find out what's out there, you have no way of knowing if it's any good.
So one of the things that we see as our primary objective is to create an online portal that will allow biomedical scientists to access ontologies, to have them indexed so that one can easily retrieve them based on particular needs, to align them to get a sense of how they're related to one another, and to provide metadata for ontologies that would allow people to annotate them with information on what parts of them are useful, what parts of them make distinctions about what may be problematic, and basically to begin to offer for ontologies what we've had for a long time [for] knowledge that's disseminated in print media, and that's peer review.
A major focus of the center is to use this online portal to help scientists who have large datasets to be able to annotate those datasets with ontologies to be able to facilitate query and retrieval and comparison and all the things that people want to be able to do with standardized ways of annotating data.
The other issue that we want to address is the fact that ontologies change that as we understand the world better, we want to be able to make different distinctions, or use different sets of terms, and we want to be able to maintain some relationship with data that were created or collected in the past with the ontologies that will be evolving currently, and trying to maintain consistency, which is a major problem in science right now.
So I may have a gene sequence and not know what it is, but two years later the protein's been identified, and I'd like to be able to go back and have my sequence re-annotated. And frankly, I'd like to be able to do that without too much thought.
Is this something that would be better off as a centralized activity, where the data providers like NCBI would play a major role in standardizing and supporting these ontologies, or is this something that can remain a cottage industry, as you described it?
I think ultimately you want to provide some place where people can go for one-stop shopping, and that's why NCBI is so tremendously successful. But NCBI is limiting itself right now to data, and we think ontologies have matured to the point where we want to be able to provide analogous sets of services.
The Gene Ontology Consortium is going to be involved in this center, and they host the OBO [Open Biomedical Ontologies] repository already. Will that be expanded into the portal that you're talking about?
One of problems with the OBO site on SourceForge [http://obo.sourceforge.net/] is that nobody has the funds to curate it. So there are a lot of really useful ontologies in that collection, but basically people write ontologies and sort of throw them over the wall, and they reside in OBO without any kind of ability to ensure that the ontologies are still useful, that they're not out of date, that they're appropriate.
Initially, we will just be pointing to the OBO site on SourceForge from our portal, but eventually we will be providing not only access to those ontologies, but the metadata annotations that will allow users to have some sense of what these ontologies are and get some insight into whether they might be useful for whatever task they want to automate.
And we see ontologies as being important not just for annotation, although that's obviously one of the major ways in which they're used in biomedicine. We want to see them use natural language processing, for building decision support systems, for data integration, and although our center is not funded specifically for some of those kinds of tasks, we're hoping that through the mechanism of these collaborating R01s that NIH is setting up, that we will be involved with projects with lots of people from outside the center who will be taking on these kinds of activities.
Are you working with the MGED consortium? You probably saw the commentary in Nature Biotech in September [Soldatova LN, King RD. 'Are the current ontologies in biology good ontologies?' Nat Biotechnol. 2005 Sep;23(9):1095-8] questioning the quality of the MGED ontology. Is MGED already engaged with the center?
Well, that's easy, because Cathy Ball, who is president of MGED, works in the next building [as director of the Stanford Microarray Database]. As an aside, our center submitted a letter to Nature Biotech in response to that article. We agree in principle that there are substantial problems in biomedical ontologies, and it's easy to pick on particular ontologies to find problems, but I think a lot of criticism that they made specifically of MGED was not appropriate.
But our goal is to begin to work closely with developers of important biomedical ontologies. One of the main activities of the center involves a dissemination activity. NIH has funded us to try to make the technology known and useful to a wide range of people, so primarily, as directed by Barry Smith of SUNY Buffalo, we will be holding workshops and seminars with the goal of being able to introduce people in biomedicine to the technologies that the center is creating, to the notion of using standard ontology representation languages, and to best practices for creating these kinds of models.
What is the role for the Protégé technology in all this? Will that need to be modified to fit more closely with the needs of biomedicine?
We don't think Protégé needs to be modified for biomedicine right now. A large number of Protégé users are people in biomedicine, just because the program was developed by people who work in biomedical informatics. But Protégé will play an important role in the center because right now it is such a widely used ontology management system, and our ontology library will primarily be using Protégé as a means for editing and storage. When it comes to ontology annotation, we'll probably still be using OBO-edit, which the GO consortium has promoted and which a lot of people feel more comfortable with.
One of the things we'd like to do, although we're not funded for it under the center grant, is to develop a new ontology-management system that would be more web-based and more compatible with the technology that we'll be creating under the center. Although right now we don't have funding that's going to allow us to do that specifically, it's an obvious thing to want to work on in the future.
What do you see as the role of emerging semantic web technologies? Is that something that would mesh well with the goals of the biomedical ontology center?
There's no question. The semantic web community recognizes that at the heart of what they need to do is development of the kinds of ontologies that will structure the kinds of web resources that we'll see in the next generation of the web. In fact, even web services are now becoming increasingly ontology based. And what is very exciting to a lot of people, including me, is the idea of the Internet becoming the framework for one enormous knowledge representation system of a scale that really is hard to fathom.
So the use of ontologies, the use of ontology services, is something that is I think going to be very exciting in the next 20 to 50 years, and we already have good ties within the working group within W3C that's working on the semantic web for healthcare and life sciences.
As you look forward, and you have all these opportunities in front of you, what do you see as the biggest challenges in meeting the goals of the center?
I think the biggest challenge is still that of formalizing knowledge. Ontologies have been touted as sort of the panacea for putting knowledge in electronic form, but it still requires enormous amounts of work to be able to identify what are the right things to say about the world, how do you ascribe precise semantics to those terms? The kinds of things that you can say about the world are infinite, and knowing how to structure ontologies so that they're maximally useful is currently still largely an art form.
I think the biggest challenge particularly as you want to build ontologies that will scale to the large requirements of major problems in the life sciences is going to be able to develop methods and supporting technologies for working with ontologies of large scale where we have a good grip on what the semantics of each one of the individual terms means.
Do you see particular challenges arising from biological information itself in terms of the vast amount of things that are still unknown, or the various nomenclatures and terminologies for different disciplines, or divergent frames of reference and descriptions for the same biological objects or is that a universal challenge for ontologies in general?
I think that's a universal challenge of ontologies in general that so much of human thought is based on major assumptions about background knowledge that people don't articulate, and in life sciences that plays out because even the most fundamental concept is something where we actually have to make policies about what we really mean by something.
Just think about what it means to say, 'What is a gene?' What portion of the DNA constitutes the gene? Does it include the promoter, does it include the intervening sequences? There's not a clear definition of where in the DNA material the gene begins and where it ends, and that's something that's so fundamental to what we do that it raises questions of how we draw boundaries in everything that we want to put into an ontology.
That should keep you occupied for awhile.
Is there anything else that you think is worth noting about this NIH funding and what you're able to do with it now?
One thing to mention is really the surprise that I and the other members of this consortium had over the enthusiasm within NIH for this kind of a center. When we submitted our proposal, we weren't sure whether ontologies would be viewed as mainstream, and the priority with which people would recognize that ontologies are important in biomedicine. And it's just been really gratifying to us to see that the National Heart Lung and Blood Institute has a request for information regarding ontologies related to its particular domain, and it really looks like there are going to be major benefits to each of the NIH institutes and basically all of us in biomedicine as we can not only build more ontologies, but actually then have them structured in some sort of an online mechanism where we can actually find the ones that we really want.