Finding the needle in the haystack, or a crucial piece of information in the vastly growing biomedical literature, can be a challenge. But life-science researchers in the UK will soon get some help: Over the next two years, the newly founded National Center for Text Mining, or NaCTeM, plans to bring online a variety of text-mining tools and training services.
The Manchester-based center, which officially launched in June, is funded with £1 million ($1.87 million) over three years from three government funding agencies. It claims to be the first publicly funded text-mining center in the world, although similar efforts are currently underway in Germany and Japan, according to center officials.
While the center will eventually branch out into other fields that can benefit from text-mining technology, such as the humanities, the social sciences, and engineering, the target user base for the first three years of the effort will be the life sciences. “The biomed domain is technically very dense and difficult to analyze. If you can interpret and analyze biomedical texts, you can perfectly analyze any [type of text],” said Ben Stapley, a bioinformaticist at the University of Manchester and a member of the center’s steering committee.
The text-mining center, set up by a consortium consisting of the University of Manchester, the University of Liverpool, and the University of Salford, is still in set-up mode: It is in the midst of a search for a director — who will also serve as chair of text mining at Manchester — and will eventually move into the Manchester Interdisciplinary Biocenter, a building currently under construction, some time next summer. The center currently supports seven full-time employees.
NaCTeM’s first and foremost task will be to help academic researchers do text mining. The center plans to both catalog tools available elsewhere and to distribute tools built in-house. In addition, it will offer training in how to do text mining and provide access to help desks and documentation.
“There are public domain tools out there, and we will be offering a means of rapidly finding these and giving links to people to download them,” said Jock McNaught, an associate director of the center. “But we will also be developing our own tools, and making them highly efficient and scalable, so we can handle large amounts of text,” he added.
Over the next two years, the center plans to first make available elementary tools, “with lots of warnings about their capabilities,” McNaught said, gradually followed by more advanced tools. Most likely, they will be accessed in a distributed environment.
On the research and tool-development side, NaCTeM has several projects in mind, based on what biologists have said they would find most useful.
One area is using text mining to interpret the results of high-throughput post-genomic experiments, such as mass spectrometry-based proteomics and microarray gene expression studies. “The mass of data means that making a semantic interpretation of the phenomena that you are observing is quite difficult without referring to the biomedical literature,” said Stapley. Text mining could help researchers, for example, to detect novel relationships between genes or proteins that show a link in an experiment by weeding out those for which previous relationships are known from the literature.
Another new tool on the agenda would be used to extract information about possible toxic properties of drug compounds from the literature. “This is not as straightforward as it sounds because there [are a lot of] variances of compound names,” Stapley said. He has been using machine-learning approaches to predict toxic properties of compounds, based on a set of compounds with known toxicities and a body of text associated with them. “The initial results have been quite encouraging,” he said.
Handling massive amounts of text is one prerequisite any new tool has to fulfill, Stapley said, especially as more and more full-text journal articles are coming online. “At present, there is really no unified way in which to access these texts to datamine. We would be very interested in establishing relations with online publishers because these are really an untapped source,” he said, adding that NaCTeM is currently in talks with BioMedCentral.
NaCTeM’s special expertise is in scientific and technical terminology, giving it a potential edge. For example, the center has available statistical and linguistic techniques to identify parts of compound terms and to determine whether they are themselves term candidates, according to McNaught. In addition, the center has techniques for handling reduced forms, acronyms, abbreviations, or other variants. “Because of our expertise in the terminology area we expect that the tools that we produce ourselves will be able to perform much better than the public state of the art,” McNaught said.
He added that the center will adopt biomedical ontologies that are currently available, such as the Gene Ontology, but will also hone its own strengths in term extraction and identification of variants of terms, “which is of particular importance in biology, where there is a great deal of volatility in terminology and often many forms referring to the same concept.”
The center also has a number of research partners, including the University of California at Berkeley, the University of Geneva, the University of Tokyo, and the San Diego Supercomputer Center. For example, NaCTeM plans to integrate a data-mining toolkit developed at SDSC (SKIDL, or SDSC Knowledge and Information Discovery Lab) with the Cheshire information retrieval system developed at the University of Liverpool and Berkeley.
After the three years’ funding is up, NaCTeM is expected to become self-sustaining “to some degree,” McNaught said. “Because we are not allowed to charge academics, that can only mean that we then seek commercial outlets for our tools and services,” offering them to pharmaceutical and biotechnology companies at commercial rates.
But there is still some work to do, and biologists should not expect miracles from text-mining technology. “I think one has to be somewhat realistic about the limitations of text mining. Quite often, the biologist comes through to us with a very complex, difficult problem, and they are surprised that we can’t solve it easily,” said Stapley.