Text mining is not a new concept, but a collaborative thesaurus project designed to bring together as many as 20,000 bioinformaticists and medical researchers by compiling information from myriad abstracts and scientific data is, at the very least, ambitious.
Aaron Cohen and William Hersch of the Oregon Health and Science University’s Department of Medical Informatics and Clinical Epidemiology describe the wiki-style approach to thesauri management and collaboration in their “Proposal for Creation of a Shared Biomedical Thesaurus Management Resource.”
The thesaurus will be based on a relational model designed to connect not only disease names but types and variations, and will be used by some 10,000 to 20,000 scientists and researchers around the world according to Phoebe Roberts, Biogen’s associate director of managing analytics, and William Hayes, head of the library and literature informatics department, who are currently shopping around the proposal.
The goal of the project is to discover the relationships, elucidate the gene regulatory networks, and establish the genotype/phenotype associations and potential therapeutic mechanisms for drugs. These relationships might include diseases associated with neurology and immunology, human proteins and mammalian proteins in general or in specific, among other criteria.
Hayes said that if the proposal comes to fruition, software and other computer needs will be overseen by Cohen’s group.
“We plan to use the thesauri generated from this collaborative project to extend thesauri that we license, while building completely new thesauri as needed for use in Linguamatics I2E, Inforsense KDE and other [as yet] unnamed analytics and internal text analytics projects and internal text search engines,” Hayes wrote in an e-mail to BioInform.
He added that Biogen sought to become involved because “there is too much work to be done by any one company. There [is] a wealth of resources available to us, but which require a great deal of work to access and manage.
”The ontology projects are providing great semantic structures for many terminologies we are interested in, but the synonyms needed to use them in text mining are spread amongst several different resources (UMLS from the NIH NLM, gene/protein synonyms in the sequence databases such as Entrez, Swissprot, internally developed synonyms, etcetera),” Hayes said.
He said there are “several” commercial tools to manage thesauri, but “none that are available for a wiki-style approach to thesauri management and collaboration.” He said that Biogen can purchase “some very nicely developed” ontologies incorporating synonyms from vendors like Biowisdom and L&C Computing, “but it's not easy to extend them, and in general, they are tightly restricted.”
Cohen, who said it took around six months to come up with the thesaurus, was motivated to propose the project because he said he didn’t see an easy way for biologists to access the information they needed.
For example, he said, MeSH has “lots of terms for different diseases, but other things are very problem specific and haven’t been formalized into any type of thesaurus or terminology. You run into biologists needing things you don’t have; it’s hard to identify information if you don’t even know the kinds of things you need information about,” he said.
“We plan to use the thesauri generated from this collaborative project to extend thesauri that we license, while building completely new thesauri as needed for use in Linguamatics I2E, Inforsense KDE and other [as yet] unnamed analytics and internal text analytics projects and internal text search engines.”
His group has also been discussing with Biogen’s Roberts and Hayes about “enhancing the collaboration between academia and industry, trying to find out where the common ground is, to find things that are interesting and fruitful for us to work on from an academic point of view and encourage participation from industry,” Cohen added.
Focusing on this idea of a collaborative thesaurus, Hayes and Roberts have targeted at least 15 members of a prospective consortium, a figure that Cohen says is somewhat flexible. So far, three are on board, he said, and hope to get it up and running in two months, though three months is a “reasonable figure,” they told him.
“Currently, one of the limiting factors in applying NER [named-entity recognition] for the purposes of text mining is the lack of availability of thesauri in the domains of interest,” Hayes said. “The creation of topic-specific thesauri is a labor-intensive task that depends upon the participation of human domain experts.”
Roberts tied it back to specific examples, such as “the ability to get a landscape when there is just too much literature to manage. … You are looking at the whole forest and not just each individual tree, [such as] what kinds of cells is my drug target expressing?” he said. “You [also] have to understand where to look for potential side effects.”
Roberts explained that there are “so many ways of saying things, so to actually wrap it all up and provide a comprehensive picture [is not necessarily] that easy to do.
“When you run a search, if you use one particular term and the engine doesn’t include all the synonyms, you will only get a portion of the results,” Hayes added.
When asked how successful he thinks the project would be, Phil Hastings, director of business development with Linguamatics, said he thinks “it’s difficult for us to answer in that level of detail, but I would say a project of this nature is certainly viable.”
Hastings added that with their software, customers are “able to plug in domain knowledge to add to the NLP and complement the text mining capabilities in the form of thesauri and vocabularies.”
Such software would undoubtedly be tapped in a successful rollout of the thesaurus, but what about other technology?
When asked, Cohen said as demand increases, so will load capacity. While initially, the project is slated to start with one server or a small collection of servers, “running essentially a website that would allow you to edit these thesauri – like a Wiki or MySpace page. It would be much more focused on the needs of the community … and allow the entering of and review of information by others.”
He added that the consortium “needs to decide whether people who are members will have access or not. … I think the goal is to make this more widely available. Certainly, there will be benefits to being a member.”