You're investigating a protein and want to know what other proteins occur in the same pathway — or maybe what other experiments have been done with that pathway or protein, and what's been published. Being the Sherlock Holmes of biology that you are, you might also like to inquire about what alternative forms are known, or what other flavors of that protein exist across a particular population. "No problem," you might say, "I can answer all of those questions without batting an eye." Sure you can — if you like pain.
Many scientists would like to be further down the information chain, spending more time on discovery and less time trolling through myriad pockets of biological data, in a plethora of formats and standards, that currently reside on the Web. But what is the alternative? "I would like computers to go out and aggregate data for me … and then bring it back to me in a form that's useable where I can contribute to the creative process farther down the chain," says Damian Gessler, program leader at the National Center for Genome Resources. "Right now, scientists, grad students, and postdocs spend an inordinate amount of time just trying to find the data, and once you find some in one website, it's very difficult to then take that and map that to associated data in other websites. It's a very laborious process, and computers can do a better job."
Mark Musen, a professor of medicine at the Stanford Center for Biomedical Informatics Research, believes that biologists will soon begin to see some of this bioinformatics burden lifted off of their shoulders. "The past 10 years have seen this incredible excitement about the possibility of putting information that can be understood by computers online and the ability to translate life sciences data in a way where computers can reason and communicate about data and knowledge," Musen says. "I think we're at the point where we could have enough critical mass within a relatively short time that the basic bench biologists will begin to see the effects of [the Semantic Web], but we're in the stage now where we have a lot of data online and knowledge online."
The semantic Web is a vision of the Internet where interoperability and integration issues are a thing of the past because all data on the Web, whether it be concepts, documents, images, or something else, is essentially tagged with a framework that allows software agents to determine if that information is relevant to whatever query you happen to be running.
The idea of creating a Web universe of information that is all tagged in the same way is, at least on paper, the ultimate solution to the challenges of data integration and interoperability for life sciences. Dialogue between bioinformaticists and semantic Web developers has been steadily increasing for a number of years now as widespread data integration problems have clearly begun to impede the progress of research. "Semantic Web gives improvement on aspects of interoperability that are very useful in many situations," says Tim Clark, director of informatics at the Massachusetts General Hospital's Institute for Neurodegenerative Disease. "Most of them can be very helpful in biology because primarily it's an inductive, experimentally-driven science, and there's a lot of classes of phenomena to observe and classes of things and names."
A brief history
The semantic Web might not seem like a such a far-flung utopian vision preached by starry-eyed zealots were it not for the way the World Wide Web grew up. Because the majority of data on the Web relies on context for its meaning, it turns out that the best processors of disparate resources that populate it are, well, people. Say you want to search for the term gene, for example: to a human it's obvious that a DNA sequence that encodes a protein is different from Gene Simmons, but that's not clear to a computer without the proper tags in place.
In the late 1990s, Sir Tim Berners-Lee, the man credited with essentially inventing the Web, began to entertain this vision for his creation where, instead of a tapestry of documents with data presented in a way only humans could effectively evaluate it, the Web would itself become one massive database where computers could be commanded to search out and manipulate data on their own. The semantic web, also known as "deep Web" or "Web 3.0," is so named because the basic concept requires that everything on it be given meaning in a way that computers can understand.
Brace yourself for some alphabet soup. By 2004, Berners-Lee and his World Wide Web Consortium (W3C) presented their recommendation to the world for the standards of the semantic Web. The essential building blocks they arrived at include the Web Ontology Language (OWL) and the Resource Description Framework (RDF), the latter a language for representing information about data or resources that can be processed by computer applications. RDF is based on the concept that every last tidbit of information on the Web should also be identified with Uniform Resource Identifiers (URIs), which would be determined by consortiums or even individuals. OWL, which is actually a family of languages, provides users with a way of defining terminology and data in a manner that applications can understand and evaluate using RDF.
In part thanks to things like RDF and OWL, a common misconception is that the semantic Web requires a top-down uniformity. "The point of semantic Web for me is that you can have a schema that is abstracted from your database — it doesn't have anything to do with your database technology — and the schema can be limited in scope but also integrate and re-use other schemas," says Clark. "You can agree on a common limited schema and it doesn't have to conquer the world."
Miles to go
A complete re-imagining of the way data is tagged and dispersed on the Web carries with it, understandably, some major challenges. One intrinsic problem is that biology itself is a moving target; our understanding is changing on an almost daily basis. In the same sense that reality is a shared consensus, so too are the ways in which biologists classify their observations. The very essence of discovery is that the observations, concepts, and phenomena, and the corresponding vocabulary to accurately describe them, often turn out to mean something completely different than we thought.
Jim Ostell, chief of the information engineering branch at the National Center for Biotechnology Information, believes that in order for a machine to automatically and autonomously determine what information to parse out, the data first has to be reliable at a level that doesn't currently exist. "There's a difference between browsing the Web as a human being where, if you have a node that says 'here's some related information about this concept,' if that related information was a Web page you could read, you're pretty versatile at making sense of out it," Ostell says. "But if you're talking about traversing it computationally, then it's much more challenging to make sure everything means the same thing and that the object that you're getting to on the next path has the same persistence, quality, and structure that you're expecting to operate on."
Ostell says that even the NCBI — a data repository significant enough that one might assume it to be in a position to enforce standards and tame the data beast — still has considerable issues when it comes to reliability and clarity of the data coming through its submission pipeline. He and his colleagues are often required to go back to submitters to figure out what their data really means.
It is this question of persistence, quality, and structure in the data that poses perhaps the biggest challenge for making the semantic Web a reality. "If I was trying to write some kind of 'knowbot' that was going to go out and explore the semantic Web and count up how many people thought something in the human genome was a pseudogene," Ostell says, "I would have to be very careful that we were using the same words and thought that they meant the same thing."
"There's been a lot of work in biology on establishing agreement on things like the names of genes, biological processes, cell types. … There's a lot of work in that area, a lot of investment of people's time and the social process of arriving at these agreements, and that all can be leveraged with this technology," Clark says.
Considerable effort has been put into ironing out these ontology issues with collaborative efforts like the Gene Ontology Project or the OBO Foundry. One of the most prominent efforts to implement these standards and get semantic Web off the ground for the life sciences community is the Shared Names Initiative. This group is spearheaded by leading proponents of semantic Web technology and is currently aimed at providing unique identifiers for publicly available biomedical information records, such as those contained in ENZYME or Pfam. The goal is to devise technical and organizational guidelines that will lead to the adoption of these resource identifiers by bioinformatics initiatives using RDF, which would eventually enable users to grab data from all sorts of sources and manipulate it however they wish.
However, Ostell warns that the process of trying to effectively deal with this biological Tower of Babel may adversely affect the subtle process of discovery. "Trying to standardize English vocabularies describing biological processes … is going to be a long, slow process, because that reflects a consensus of understanding, and in science understanding is an evolving thing and consensus is a chimera," he says.
It's not just the cost and effort of implementation that are stumbling blocks. Some folks just think it's all still too hard. Where is the semantic Web your parents can use? "First of all, the word 'semantic Web' is meaningless to a lot of people. It's a very small group of people that actually understand what the words mean, what the concepts are, what this crazy RDF stuff is. It actually takes a while to understand, and to deploy it is a whole other thing," says Brian Gilman, founder and CEO of Panther Informatics. "You're not going to see the semantic Web until somebody pares down this technology and makes it easier to understand."
Several years ago, Gilman helped launch the W3C's Health Care and Life Sciences Interest Group, a taskforce effort aimed at promoting the use of semantic Web for biology, health care, and translational medicine.
Proof is in the pudding
The semantic Web community recently got a little injection of self-worth when the greatest of all Web equalizers, Google, announced in late March that its search engine was now equipped with some semantic capabilities. Instead of an algorithm that relies on keyword associations, Google is now capable of analyzing a search query and providing results based on concepts and associations connected to that query. This semantic improvement to arguably the most used search engine on the planet will allow for a more natural language-based usability and, more generally, makes a strong argument for what the Web will look like in the 21st century.
But where are the tools? One group of dedicated software developers at MIT has made serious inroads in developing open source software tools for semantic Web, many of which can be used as add-ons for Firefox. David Karger, a researcher in the Computer Science and Artificial Intelligence Laboratory at MIT, heads up the Semantic Interoperability of Metadata and Information in Unlike Environments project, a team of software designers dedicated to developing semantic Web tools. "One of the nice things about the semantic Web is, it's designed to encode information from any domain," says Karger. "It's designed to allow people in the domain to decide what the important information is in their domain and how it should be structured." The group has to date developed more than 20 different software tools designed to take advantage of semantic Web and RDF — the beauty of which is that users do not need to define everything in a database structure because it's already been defined by the RDF language. This allows applications to slice and dice data files from disparate sources and mash them together inside of Web 2.0-like display technologies.
One of Karger's creations, called Exhibit, has been used to create a semantic Web display of kinases and their relationships to certain diseases. Exhibit can take virtually any spreadsheet and turn it into an interactive Web-based map or table with just a small amount of HTML. "Exhibit allows you to combine those things that are talking about the same thing, say a gene or a patient in a study, and if you have additional information coming from another source, it connects it to the same hub," says Eric Neumann, director of the Clinical Semantics Group and co-chair of W3C Healthcare and Life Sciences Interest Group. "For all intents and purposes we can define semantic Web as hubs and spokes — the hubs are the things we want to talk about, and the spokes are all the interesting properties we have, like the gene length. Tools such as Exhibit are able to stitch together these disparate things into a common graph, then the user experience takes off where you don't have to worry about all the semantic connectedness, all you have to do is specify what things you want to look at."
Neumann has also put together several demos using semantic Web technology along with publicly available data that has been converted into RDF to show just how powerful this technology can be. One demo provides an interactive Web browser which allows users to do a query for all genes that are known to be expressed with the Allen Brain Atlas project and also have basic information in the gene properties, such as name and location in terms of its GO function, and return a result as one big bundle of RDF. Then, as those parts of the brain atlas that the user is interested in are found, it can bring up the information for the related brain slides and for every gene that's been tested on them, all the different expression levels, and provide a link directly to the atlas. "So what I'm getting back in that query are the genes, actual locations, and staining per gene, from the Allen Brain Atlas, plus all the complex information which is several dozen columns of information both around the gene and the images. … If I were to put all of that into a spreadsheet, there's not much I can do with [it]," says Neumann. "But our viewer actually takes the result of that query, all that hard work, and rearranges things on a Web page in a visually accessible way with the links going out from them to the Allen Brain Atlas, and the links constructed from them back to Entrez Gene at NCBI, all bundled locally."
A group of pharmacogenomics researchers recently demonstrated how semantic technology could improve PharmGKB, a major knowledge repository for relationships between drugs, disease, and genes. "What the semantic Web brings to us is a way to integrate all of our data, and be precise with respect to what we're actually talking about — the real relationships that exist between a protein, a gene, a disease and so on," says Michel Dumontier, an assistant professor at Carleton University in Canada. What Dumontier found was that the genetic variants were not being described in the site's database. For example, if you have a particular variant in a population, what exactly happened? Did the patient die? Such queries could not be answered so easily. "What we did was transform their data into a basic ontology and populated it, so that we had a knowledge base which contained all the basic terminology you would expect, like genes and drugs, outcomes, responses, phramacokenetics, so we could ask about the relationship between these things," Dumontier says. By using OWL, they were able to extend the database and add specific relationships between a variant and drug response.
Another successful proof-of-principle is the Neurocommons project, a knowledge management platform that aims to provide a way to integrate a number of biological data sources that otherwise wouldn't play well together. The group used the task of prospecting for Alzheimer's drug targets to demonstrate semantic Web's effectiveness and, in doing so, discovered some useful things. "By expressing data in…OWL [and] RDF, we've found that it's possible to link the knowledge together in a way that allows for data integration in a more seamless fashion than if you have many different relational databases," says M. Scott Marshall, a researcher in the Integrative Bioinformatics Unit at the University of Amsterdam. "One of the keys to that is that the data schema — the metadata, the description of your data — is in the same language as the data itself when you have it in RDF, and that's very useful because it allows you to do things like machine reasoning across it."
Not the point
Many researchers in pharmaceutical companies are also looking toward semantic Web technologies and the potential of software agent reasoning, even though their "web" doesn't reach beyond the confines of company walls. "Data integration for integration's sake is not the point, and it would be pretty disappointing if that's all that semantic technologies achieved in pharmaceutical research," says Greg Tucker-Kellogg, chief technology officer and senior director of systems biology at the Lilly Singapore Centre for Drug Discovery. "There need to be some applications of reasoning, even simple intelligent agents should be able to leverage semantic integration to relieve some of the tedious hunting and gathering that consumes so much time in the early phases of knowledge work."
Tucker-Kellogg also sees semantic Web technology aiding pharmaceutical research with data integration and accessibility, a persistent challenge given the volume of data from modern technology platforms, the diversity of data in integrative studies, and the importance of incorporating legacy, published, or public domain data in many areas.
The Semantic Web Applications in Neuromedicine (SWAN) project is an initiative aimed at producing a knowledge base for better drug development in neuromedicine. SWAN analyzes statements that scientists make in publications or on the Web and allows users to ask questions about the relationships of these statements to ones made by other scientists and check for consistency. "SWAN gives you a concept map of the space that is … multidimensional because it recognizes that at the leading edge of research there are many disagreements about how to interpret experiments, or if even certain experiments were properly performed," says Clark, also a principal investigator in the project. "It enables you to connect the thinking across the space and see where the gaps and the contradictions and theory are in order to get a bird's eye view of the whole space, so it's great for students, it's great for people in pharmas who are moving from one area to another and [have] to get a really rapid picture of the space, and even very experienced researchers by enabling them to get an idea of the reasoning in the space."
Ultimately, what the semantic Web community hopes to have are applications that will make the compelxity of the technology as invisible as possible. "The goal is not for biologists to become experts in RDF and do knowledge representation, but to have tools that will, behind the scenes, transfer information into formats where they will be made available online and used by both collaborators and intelligent agents, like computer programs," says Musen. "I don't think we could expect biologists to do their own programming in these kinds of [semantic Web] languages. ... We'll reach a stage where it will be systems with which bench biologists will interact and take their data and descriptions of those data and publish them in ways which others can take advantage."
Even proponents of semantic Web are aware that it will not solve all the integration problems of the world and that it won't replace a lot of existing formats and standards. "Semantic web shouldn't be a religion, but people always turn technologies into religions. That's how you get the funny notion that every new wave of technology is going to cure the problems of the world, which are the problems that were introduced by the last great wave of technology that was going to cure all the problems of the world," Clark says. "Every wave comes, and people get disillusioned with it, but like the beach, every wave … cleans off a little bit of sand each time."
Acronym Central
It can be tough to sort out the semantic Web with all the acronyms out there. GT breaks down what you need to know.
W3C: World Wide Web Consortium
An international consortium of institutions and individuals focused on creating standards for the Web. It was founded by Tim Berners-Lee in 1994 and aims to promote standards that enable Web interoperability, as well as semantic Web technology.
RDF: Resource Description Framework
A language for representing information about resources on the Web, particularly metadata such as titles or author, in a format that can be processed by software applications. RDF allows information or data to be exchanged between semantic Web-enabled applications without a loss of meaning.
URI: Uniform Resource Identifier
A string of characters used for identifying a resource on the Web.
OWL: Web Ontology Language
A family of languages used to represent the meaning of terms in vocabularies and the relationships between those terms. It is based on two semantics, OWL DL and OWL Lite, and is considered the foundation of the semantic Web.