A new project spearheaded by Teranode and Science Commons promises to be the first large-scale publicly available bioinformatics resource built upon semantic web technology. If successful, the effort could drive adoption of semantic web tools in the life science informatics community, where these methods have sparked a good deal of interest, but little buy-in to date.
Matt Shanahan, chief marketing officer of Teranode, described the project as "a proving ground" for "the value that semantic web technologies can bring, the speed at which those things can be developed, and how this can actually tie multiple different laboratories together."
Teranode and Science Commons said last week that they will build a resource called NeuroCommons.org, which will link publicly available data, tools, and open access scientific literature related to neurology research. Teranode's XDA (experiment design automation) software will be used as the backbone for the project, while Science Commons â€" an offshoot of the non-profit Creative Commons initiative that focuses on copyright issues related to scientific data â€" will curate the content for NeuroCommons.org. All data for the project will be available in RDF (resource description framework) format, the semantic web's equivalent of HTML.
While a press release announcing the initiative described NeuroCommons.org as a "repository," John Wilbanks, executive director for Science Commons, said, "I would not call it a repository, I would call it a web â€" in the same way that the web isn't a database. It may have databases that are attached to it, but it's not itself a database â€" it's a web, it's a network, and that's what this is going to be."
"Most companies don't come in and ask us for semantic web technologies. Nobody's saying, 'Hey, I want to buy semantic web.'"
The primary goal of the project, according to Wilbanks, is to address a problem that has plagued life science informatics for years: "Really what this is about is making it more efficient to reuse the knowledge that other people have created and put into the research commons," he said.
Currently, he added, there are both "legal and technical barriers" to accomplishing that goal, but Science Commons intends to address the former while Teranode addresses the latter. "Science Commons and open access are about the legal barriers, and semantic web is about the technical barriers. So this is a chance to bring those two solutions together and really see what happens," he said.
Teranode's Shanahan said that the partners expect to finish a prototype for the system in the first quarter of 2006, with a public beta released in the third quarter and full public availability in the fourth quarter.
Wilbanks said that the core content for NeuroCommons.org will likely include gene expression information, pathway data, and open access journal articles, but noted that the project has not yet identified specific resources that will be part of the network. "The hope is that people will start to attach data sets" as the project progresses, he said.
Selling the Semantic Web
For Teranode, the project represents an opportunity to showcase its use of semantic web technologies for integrating life science data â€" a capability that is oft-touted by semantic web proponents, but has yet to be proven in the eyes of some observers [BioInform 11-08-04].
Teranode is one of the first firms in the life science informatics sector to embrace semantic web tools as part of its commercial platform, and Shanahan told BioInform that the NeuroCommons.org project could help validate the technology among the broader life science research community.
"As the value and the understanding of the semantic web rises, we would certainly benefit from that, and that's our goal, primarily," Shanahan said.
While Teranode is currently working with four undisclosed customers to deploy semantic web methods behind their company firewalls, Shanahan said that many customers are still wary of the approach.
"Most companies don't come in and ask us for semantic web technologies. Nobody's saying, 'Hey, I want to buy semantic web,'" he said. "In those situations, what we're selling is a novel new application that they haven't been able to buy before. â€¦ Now we have this R&D dashboard, and what happens is we turn around and say, 'By the way, the way we made this work was semantic web.' And so it's really sort of after the fact. That's not their buying criteria."
Some in the industry remain skeptical that the semantic web will deliver on its promise.
Rainer Fuchs, vice president for research informatics and operations at Biogen Idec, wrote in an e-mail to BioInform that "there's certainly quite a bit of 'buzz' around this topic these days, but I don't think it has moved into the mainstream yet, with applications of tangible benefits."
Fuchs added that "20 years of work in heterogeneous data integration in the life sciences have yielded precious few universally accepted standards. The reason for that was certainly not a gap in technologies, but fundamental (and often well justifiable) differences in opinion of the best way to define, describe, and represent concepts in a given area. It beats me why a new set of technologies should suddenly change the human part of this equation."
Fuchs did note, however, that even if semantic web methodologies fall short of their promise for large-scale applications, smaller-scale integration efforts within a single organization could benefit from the technology. "Where overwhelming scientific or commercial interests provide the required impetus for agreement on domain/application-specific ontologies, I can see semantic web technologies become an interesting and important addition to a developer's toolbox," he said.
Robin McEntire, director of knowledge-based systems at GlaxoSmithKline, said that "the semantic web is an area we're very interested in at GSK" â€" particularly for applications in data integration and "lower-level reasoning and inference."
McEntire noted that "simple" RDF utilities, like RSS feeds, have already taken hold within pharma, while other semantic web tools are "a little further away, and we're a little more guarded about committing to those tools." McEntire said that GSK would be interested in pilot studies with research groups or small firms willing to "show me the semantics and show me where it can help my business."
He added that although he is "still skeptical" about certain aspects of semantic web technology, "I think in 2006 we will get more serious about pulling some things in here â€¦ and leveraging some of those tools."
Pierre Lindenbaum, a bioinformaticist at Integragen, said that his company is in an "exploratory phase" regarding the use of semantic web tools. Integragen currently stores microarray data and information from family studies in a relational database, and is thinking about using RDF and the OWL ontology language to exchange data with its customers, he said. However, he noted, generating RDF files "is not a trivial task for a non-technician." In addition, he said, "third-party RDF is difficult to analyze."
Kei-Hoi Cheung of Yale University, who developed the YeastHub database (http://yeasthub.gersteinlab.org) using RDF and other semantic web tools, said that the "extra layer of overhead" from the RDF format leads to a "performance bottleneck" in the system. He added that mapping methods for converting relational data to RDF are still too complex for most biologists to use.
The Neurocommons.org effort comes at a time when the World Wide Web Consortium is reaching out to the life science community as an early adopter for semantic web methods. Last month, the W3C launched the Semantic Web for Health Care and Life Sciences Interest Group (HCLSIG), which will focus on applying semantic web technology to data-integration challenges in the health care and life science industries [BioInform 11-28-05].
The HCLSIG is the W3C's first such domain-centric initiative, and will hold its first face-to-face meeting Jan. 25-26 in Boston.
Shanahan said that a "public reference" like NeuroCommons.org should address many questions that still linger in the informatics community about the promise of the semantic web. He acknowledged that Teranode itself didn't have a "vision and a master plan" to adopt semantic web tools, but only realized the benefits of the approach after developers at the firm "bumped into it."
Teranode was previously storing its data in XML, but "the complexity of life science data makes your XML proprietary," he said. "Search engines don't know how to index it, and things like XSLPs don't know how to transform it."
In addition, he said, "XML was designed for message data â€" so it's hierarchically ordered and linearized. Life science data is graph oriented, so you've got lots of annotations, biochemical pathways, and all this stuff doesn't fit neatly into a message."
RDF, however, "is explicitly designed to go after graph data," and builds on the XML standard, Shanahan said, which enabled the company to change its strategy. "We'll integrate to anything that's RDF-based â€" so anything that outputs semantic web data. And now, the customers, if they've got a relational database, it's so easy â€" it's like outputting XML from a relational database, but you just output RDF. So now it suddenly gave us leverage, because not only could we make our data more open, but we gained access and could integrate data from many more sources."
Shanahan acknowledged that semantic web technology alone is not yet a selling point, but said that once customers get their hands on the technology and "see the novel applications that can be done, then they are really changing their mind about what it is they can do."
â€" Bernadette Toner ([email protected])