Consider this scenario: As Jane and David walk over to a lab computer one morning, she says, "I uploaded the description of our sequencing project of chromosome 7 last night — including the FISH data. Let's see what my agent found for us." She opens her Web browser and starts reading the information. "Interesting," she says. "It looks like the Cancer Genome Anatomy Project just uploaded some new information about the ABC transporter family, and there are two new papers that have published results for its role in multi-drug resistance — which it tells me is abbreviated MDR. It also overlaid all the haplotype blocks from the HapMap project and provided a link to it on the Web."
David points to the screen and scratches his head. "This is curious. The agent suggested some people to contact that have clinical data associated with ABCB1 along with their papers and phone numbers. How'd it get that?" Jane replies, "Must have learned something new last night. Let's download the papers and see what they have to say. Then we can make the phone calls."
The events described in this scenario are virtually impossible given today's technology. While humans are very good at pattern matching, classification, and semantic matching of free text, the problem is that computers are not. In fact, computers are quite poor at deriving meaning from Web pages and other types of free text. However, a technology called the Semantic Web could help make this scenario a reality in the not-so-distant future.
Today, it seems that we can use the Internet to find just about anything. In the daily business of science, we can search for a paper describing a specific gene or protein like ABC. We usually bring up a search engine interface like Google or Yahoo!, type in a few keywords, and receive a list of Web pages matching our criteria. Search engines do a surprisingly good job of finding Web resources that match the query string entered.
However, not all Web searching is seamless — especially when the topic you are exploring is not in your specialty or is poorly indexed. As scientific complexity and scope grow, the need for cross-disciplinary teams from different specialties and subareas has become increasingly important to the discovery effort.
As we all know, every discipline uses its own jargon. As is often the case, researchers in disparate fields may use the same word to describe different entities, or completely different sets of words and acronyms to describe the same topic. In our opening example, a search for ABC and oncogene yields a hit for Aneurysmal Bone Cyst rather than the gene symbol ABC. Another example confounds the issue further: other efflux pumps of the mammalian cell membrane in ABC superfamily include multidrug resistance-associated proteins (MRP) and breast cancer resistance proteins (BCRP; mitoxantrone resistance proteins, MXR). Other than the fact that these resistant proteins belong to the ABC superfamily, they are quite different with respect to gene locus, amino acid sequence, structure, and substrate. It is not hard to imagine people new to the field getting lost in the acronyms or following links that, while informative, do not answer their original question.
The Web is even more limited when one wants to integrate, save, or summarize a collection of information. Bookmarks help, but do not allow information found on one website to be integrated with information from another. Clearly, current technology is insufficient to satisfy the scientific enterprise where these activities take place on a daily — and sometimes hourly — basis.
That's why scientists and engineers at the World Wide Web Consortium, or W3C, are trying to tackle this problem head on. They are part of a group called the Semantic Web for Life Sciences, and they're utilizing new technologies that may make the scenario described at the beginning of this article come to fruition. (Full disclosure: I am a founder of the W3C Semantic Web for Life Sciences working group.)
The Semantic Web is described as an extension of the current Web such that information produced for human consumption is also given well defined, machine-readable meaning. This means that documents can now be linked together to allow the computer to understand how the terms found in one Web page relate to another. This seemingly Herculean feat is accomplished using technologies called OWL, or Ontology Web Language, and RDF, short for Resource Description Language. These two new technologies allow a person (or more likely a computer) to express an entire vocabulary of terms in machine-readable format. The National Cancer Institute, for instance, has expressed its entire MetaThesaurus in the OWL language. This new format allows a computer to find a precise definition of ABC with regard to disease, organism, superfamily, locus, and cellular function — and relate this information to other terms used on the Web.
These new documents will allow the Web to become a very large, distributed knowledge base. You can think of it as a large Wikipedia for computers to look up meanings of terms to make sure the information that they are integrating is contextually appropriate. For example, in our agent-based scenario above, the description of Jane and David's research proposal for chromosome 7 may have included a machine-readable description of the term "chromosome." This definition might include the coordinates they were going to sequence: 7q21. Another researcher would then be free to link to this definition and augment the information provided with new data, such as the fact that this is the locus for ABCB1. A computer program that understood these definitions could then parse the information provided and search Pubmed for ABCB1 and attach a paper on multiple drug resistance. Then, researchers who come to the site would be immediately informed that this protein family has something to do with multiple drug resistance, even though each individual piece of information said nothing about that fact.
So far, it seems that bioinformaticists and bench scientists are either unaware of the W3C effort or have not seen the benefits of the technology. I hope that this brief introduction shows the utility of the Semantic Web and will give you the impetus to perform your own research into the technology to see where it fits in with your research efforts.
Brian Gilman is president and founder of Panther Informatics, a computational biology consulting service company. Previously, he was group leader in the medical and population genetics department at what is now the Broad Institute. http://www.w3.org/2001/sw/hcls/