The World Wide Web Consortium is still in the early stages of developing the next generation of the Internet — the semantic web — but some tools coming out of the effort have already been adopted by the bioinformatics community to help manage complex data from multiple sources.
Recognizing this emerging user base, the W3C hosted a workshop in late October in Cambridge, Mass., to encourage further adoption and learn a bit about the life science community’s unique needs.
“The goal was really to reach out and get as many representatives from the life sciences community that were doing semantic web or semantic web-related work in as many ways as possible, and to try and bring all that together and connect the dots,” said John Wilbanks, a W3C fellow who helped organize the workshop.
“One of the things that we really didn’t know when we started putting this workshop together was just how big the momentum of the movement was,” he said. Indeed, with more than 100 people in attendance, organizers said they had to turn people away.
As several attendees noted, the turnout was higher than any meeting held so far by any of the life science standards bodies, such as the Interoperable Informatics Infrastructure Consortium, or the Object Management Group’s Life Science Domain Task Force.
And, according to W3C officials, semantic web technology will address many of the data-integration challenges that these groups have tried to grapple with, only to fail.
Eric Miller, semantic web activity leader for W3C, described the semantic web as “data integration at web scale,” via a framework that enables the creation, storage, and retrieval of machine-processable data. Unlike the Internet today, in which queries stop at the file level, “we want to be able to link data in the same way we link documents on the web now,” he said.
Daniel Weitzner, technology and society domain leader at the W3C, said that among a number of things that he and his colleagues learned at the workshop, one of the most important was that “the quality of data integration tools — both inside the enterprise and across the different parts of the life science research community — really are terrible. And they’re terrible in a way that is imposing fundamental impediments to the progress of research in life sciences.”
W3C’s mission for the workshop, he said, was “both to understand what those impediments are about, but also to try to understand what W3C might be able to contribute to helping alleviate some of those barriers.”
The exercise is ultimately expected to benefit the broader semantic web development effort underway at the W3C, Weitzner said. The life science community “is really on the verge of becoming one of the major constituents of semantic web technology in general,” he said, “so I would certainly expect requirements in this community to have an impact on where the specs go overall.”
Weitzner said that W3C will release a report on the workshop some time in the next few weeks that will “document the issues raised” as a basis for further discussion. He said that W3C has already identified three main areas where it can help: ontologies and core vocabularies, life science identifier mechanisms, and support for early adopters of semantic web technologies.
In the case of ontologies, he said, W3C wouldn’t recreate domain-specific resources such as the Gene Ontology, NCI’s Enterprise Vocabulary Services, or other life science-controlled vocabularies. “What we’re looking to do is to identify a couple of vocabularies that could help stitch those activities, stitch those research results together,” he said.
In the area of life science identifiers, Weitzner said that he and his colleagues are taking a close look at the LSID proposal that was developed by the I3C and is now working its way through the OMG’s standardization process. “There were lots of questions raised about the LSID,” he said. “Some people wanted just to get on with it and use it, and other people [were] not quite sure what they would use it for, or how implementation would really work on the global scale.”
Weitzner added that the subject of identifiers “is something that the web community has a lot of experience in and a lot of interest in the architectural choices made there.”
Finally, he said, “We saw interest in a way to get together … implementers to help them to develop best practices about how life science applications could be developed.”
Users are Interested, but Data is Scarce
Press was not permitted at the workshop, but attendees that BioInform spoke to were enthusiastic about what they saw and heard.
“I think the semantic web actually is a good solution for the data-integration problem,” said Ted Slater, an associate research fellow at Pfizer. “But I think it’s better than that.” Ultimately, he said, the technology should be able to allow scientists to “use machines to make the proper inferences for us in order to interpret data” and generate hypotheses.
But despite Slater’s optimism, semantic web technology is still too unproven for companies like Pfizer to embrace it on the corporate level.
“There’s a serious danger of people looking at this and saying, ‘No way — it’s so far off, and you can’t do anything practical.’ But you can,” he said.
Slater cited recent developments like the release of Uniprot in RDF format [BioInform 07-26-04] as important examples of progress in the field, but he said that there will have to be more such instances before adoption picks up within pharma.
“There’s not a lot out there right now” in terms of semantic web-ready data resources, he said.
This may continue to be a bottleneck. Eric Jain, the Swiss Institute of Bioinformatics developer who generated the RDF version of Uniprot, said that it would be a “tremendous boost for the semantic web” if the major data providers would begin distributing their data in RDF format, “but I don’t see this happening until there is a strong demand from users — which I expect to arise as soon as more powerful and easy-to-use tools become available.”
Jain said that generating RDF files is “not any more complicated than generating XML files with a custom schema — but you do need to be familiar with a few additional W3C specifications.”
Melissa Cline, a bioinformatics scientist at Affymetrix who has released some of her research data in RDF, said there are some “scalability issues” with current semantic web tools. “RDF isn’t a tool for everything; to make best use of it currently, you need to keep your data down to modest bite sizes,” she said. “When I distributed my research data, I partitioned it down into a number of files. In retrospect, I should’ve made that number even larger, as I have had reports of people running out of memory while working on this data.”
Nevertheless, she said, RDF offers a key advantage for complex biological information. The dataset that Affy released “involves the intersection of two complex phenomena, [so] we needed to have an ontology available to describe the data. RDF allows an ontology to be defined and distributed in the same file as the data, in a computable form. … At the end of the day, an ontology that can be computed on is just more useful,” she said
But users shouldn’t get their hopes up that the major database providers will be generating RDF versions of their resources any time soon. Jim Ostell, chief of the information engineering branch at NCBI, told BioInform that “NCBI has no plans internally to implement a semantic web technology like RDF with OWL. In addition, we have had no requests from the user community for this technology either.”
Wilbanks said that the W3C is addressing this issue by “working on some approaches that don’t require the conversion of databases into RDF in order to use those databases in the semantic web.” One such technology, a query language under development called Algae, allows users to write their queries in RDF, but have the queries execute inside an SQL database.
“This, in a sense, lowers some of the adoption requirements,” Wilbanks said, adding that while there are some “core databases” that one might want in RDF, “there is an enormous benefit to keeping some types of information in relational databases, where there has been 20, 30 years of research and optimization. The key is actually to be able to go from RDF to SQL and back.”
Wilbanks added that he and his colleagues at the W3C expect “that as the NCI continues to use semantic web technologies, as you start to see people at the NIH and people at the National Library of Medicine see the benefit of semantic web approaches … that we’ll see a combination [of formats], of making RDF yet another option for downloading.”
Some Ready-to-Use Semantic Web Technologies:
- Protégé: A semantic web ontology editor: http://protege.stanford.edu/.
- Haystack: Semantic web browser: http://haystack.lcs.mit.edu/.
- Urchin: RSS aggregator based on semantic web standards: http://urchin.sourceforge.net/index.html.
- SPARQL: Query language for RDF: http://www.w3.org/TR/2004/WD-rdf-sparql-query-20041012/.
- Algae: http://www.w3.org/1999/02/26-modules/User/Algae-HOWTO.html.
- See BioInform 07-26-04 for additional semantic web tools and resources.