Will bioinformatics be a primary driver for the next-generation Internet? According to some semantic web developers, it’s very likely.
The semantic web — an effort to add machine-readable semantics to the existing web architecture to create an environment where software agents carry out tasks for users — is “a technology looking for a problem,” said Dennis Quan, a software engineer in IBM’s Internet technology division. Conversely, he said, bioinformatics “is a problem looking for a technology.” With a growing number of bioinformatics researchers turning to semantic web technologies to automate the painstaking data integration process, the semantic web community may have found its long-awaited killer app.
Launched by the World Wide Web Consortium (W3C) in the mid-1990s, the semantic web project has produced a handful of technologies, but none of them have found widespread adoption. “The reception to this idea in the world at large right now is not that great,” Quan admitted. The slow pickup is due to several factors, including guilt-by-association with failed first-generation artificial intelligence projects and a lack of user-friendly tools. “There’s still a large part of the [semantic web] community that’s using this as a theoretical playground,” Quan said. However, he and others in the field view the project as “a way to enable a large class of applications for people who actually need this stuff today — like biologists.” A core group of semantic web developers now views bioinformatics as a key driver for the nascent technology.
Eric Miller, semantic web activity lead for the W3C, told BioInform that he’s “excited about bioinformatics” because it serves as a “wet lab” for the emerging semantic web infrastructure. “I believe that the technologies that we are working on here can be demonstrably beneficial for the bioinformatics community,” he said.
A Step Beyond XML
Some in the bioinformatics community have already begun to take advantage of semantic web technologies. In particular, the W3C’s RDF (Resource Description Framework) is finding its way into in-house data integration projects at a number of life science organizations.
Beyond Genomics, for example, is using RDF to integrate data and enable annotation sharing and hypothesis publishing between scientists. RDF-based tools that the company has developed so far “are literally helping us drive deals with our pharmaceutical collaborators,” said Eric Neumann, BG’s vice president of bioinformatics.
RDF is structured very much like XML, but “picks up where XML leaves off,” according to Neumann. Just as XML improved upon HTML by describing data as well as displaying it, RDF improves upon XML by defining the semantic relationships between objects. A DTD (document type description) or schema provides the syntax, or grammar, for an XML file, “but the thing most engineers don’t realize is that grammar does not equal semantics,” Neumann said. If one XML schema requires “gene” to be followed by a gene product, for example, it’s difficult to integrate it with another schema that doesn’t follow the same syntax without modifying the schema or writing additional code.
“What we’re starting to hear from our really big power users is that they’re starting to hit the wall with XML,” said Loralyn Mears, segment manager for life sciences market development at Sun Microsystems. “There are so many data types, and so much of each data type, that the schemas are just no longer manageable.” RDF, she said, offers a higher-level framework, “so instead of having to change the schema each time [you add new information], you just have to change the attributes of the data type. This is obviously much more manageable for a community that doesn’t know what the next data type will be.”
The relationships that RDF describes are provided by a controlled vocabulary or ontology. Several life science ontologies, including GO, are currently being recast in RDF. One of the advantages of RDF, according to Quan, “is that it allows different ontologies to co-exist…describing multiple connections in different domains between different objects.”
In practice, “the ontology is automatically part of the engine for data integration,” Neumann said. “I can say this piece of data from GenBank and this piece from EMBL in RDF are equivalent — they may not be identical data structures, but they’re talking about the same gene.”
Life science firms like Beyond Genomics aren’t the only ones putting RDF to work. Sun, for example, has identified RDF as a key technology in its knowledge management strategy for the life science market, Mears said. The company has launched an in-house project called swoRDFish to manage its corporate-wide digital information. The system includes a controlled vocabulary, organizational classifications, business rules, and a core set of metadata tags. Mears said that the project has been extremely successful within Sun and that the company is considering a commercial version of the system. Sun is in discussions with some of its life science partners to use a modified version of the swoRDFish model in conjunction with an RDF-based search engine the company developed, Mears said.
At Sun’s next Life Sciences Advisory Council meeting, scheduled for Nov. 9-10, “we’re going to be asking the community, ‘How important is RDF, could swoRDFish be useful to you, and how would you use it?’” Mears said. Sun is also in “advanced discussions” with some pharmaceutical companies for pilot projects that would implement RDF and the search engine as part of their knowledge management systems.
MIT has also turned to RDF for its Haystack project, a desktop client that unifies data and applications into a single browser. Quan, who works on the Haystack development team in addition to his work at IBM, said the six-year-old project has been “rewritten from scratch using RDF” over the last two years. “An ongoing experiment has been trying to adapt a system like Haystack to become some sort of biological workbench, so that biologists can not only browse the existing space, but start structuring and creating information in the system itself,” Quan said.
The UK’s MyGrid project is also using RDF as part of its plan to merge the semantic web and grid computing to support bioinformatics research [BioInform 10-07-02].
Additionally, the I3C is eying RDF as its standard metadata specification. Quan, who worked on IBM’s LSID implementation for the I3C, said his team is using RDF for some prototype tools based on LSID. Just as the LSID specification is standardizing naming conventions for biological information, Quan said RDF can be used to standardize the connections and relationships between those pieces of information. “Now that we can name all this stuff in a consistent fashion, the semantic web will be able to describe the connections between those things in a consistent fashion,” he said.
Quan said he has also written scripts to convert some of the NCBI’s XML formats into RDF, and has started to browse them using prototype clients. “As the LSID standard becomes more widely adopted, we’re going to see a lot more demand for clients that can use this stuff,” he said.
Not a Silver Bullet
Early adopters are finding that RDF can solve a number of bioinformatics integration problems that XML can’t, “but it doesn’t fix all the problems, and it won’t push XML out,” Neumann noted. In cases where data must be handled in a very structured manner, such as when it comes directly off a lab instrument, XML DTDs and schemas work fine, he said, “but the minute you try and put in information about your experimental design, how you analyze the data, how you interpret it…forget the XML — go to RDF.”
Quan stressed that Haystack and the I3C prototypes are still research projects. “I don’t think RDF is something you’re going to see on every biologist’s desktop tomorrow or next month,” he said. “I think this is a year or two down the line.”
Right now, Miller said, the W3C is just wrapping up the RDF standard and gearing it up for deployment. “It’s really taking these technologies out for a test drive, stress-testing them, learning from them, and in turn shaping future work in the standards area…We learn as we go, but it’s groups like bioinformatics that make that happen, frankly.” Miller added that feedback from the bioinformatics community would be crucial to the standard’s development. While pleased with the number of in-house projects he’s seeing in the life sciences sector, he called for “more open collaboration” to help advance the technology.
Noting that RDF “hasn’t been proven” yet, Mears said that’s exactly why Sun is pushing for pilot projects among its life science partners and customers. “There’s no better acid test than the life science industry,” she said.
Semantic Web and RDF Resources
- W3C’s RDF web page: http://www.w3.org/RDF/
- W3C’s Semantic Web page: http://www.w3.org/2001/sw/
- Beyond Genomics’ RDF Pathway demo: http://beyondgenomics.com/frodo/rdf-biopath.htm
- IBM’s LSID page with information on RDF: http://oss.software.ibm.com/developerworks/opensource/lsid/?Open&ca=daw-ws-dr
- MyGrid Home Page: http://mygrid.man.ac.uk/
- MIT’s Haystack: http://haystack.lcs.mit.edu/index.html