Interest in text mining tools for biology has surged within the academic and business community over the past few years, and the field has now lured its first commercial bioinformatics player. Over the past year and a half, Lion Bioscience has conducted a research project to develop a technology to mine highly specific information from biological literature and documentation, with the extraction of protein-protein interaction data on nuclear receptors from Medline abstracts being one already realized example.
While the technology is still in the final stages of development, Dietrich Schuhmann, director of healthcare IT at Lion, told BioInform that the company is evaluating the market for the technology, which he said could set a new standard for information retrieval, extraction, and analysis within the life sciences industry.
“For the moment, we’re interested in discussing our technology with potential users within the pharmaceutical and related industries,” said Schuhmann. “After having successfully introduced our approach at ISMB in Copenhagen we are now at the stage where we need to learn about the specific needs of the user so that we can adapt the technology accordingly.”
While a number of commercial entities, including IBM, Xerox, and SAS, offer software tools for general information extraction, bioinformatics vendors have been slow to develop text-mining software of their own. However, the torrent of data that prompted biology’s swift transformation into an information science has also increased awareness of the vast amounts of functional and structural data locked within the biomedical literature. On a website he hosts devoted to the subject (www.bionlp.org), Northeastern University’s Bob Futrelle estimates that, “The volume of biology literature each year, measured in bytes, is about fifty times the size of the entire human genome, junk and all.”
But biological language is often more complex than that of other fields, which has stymied a number of academic research efforts in the area. Term ambiguity is a key problem: “Gene,” for example, can mean either “a DNA fragment transcribed and translated into a protein” or “a DNA region of biological interest with a name that carries a genetic phenotype.” Synonyms, acronyms, abbreviations, and a host of additional linguistic complexities make biomedical text mining particularly difficult.
Recognizing these challenges, Lion assembled a team of biologists, statisticians, chemists, computational linguists, physicists, and software engineers who are solely devoted to the text mining project. In addition, Lion is working with various academic partners, including Salford University in the UK, the Universidad Autonoma Cantoblanco in Spain, and the Ludwig-Maximilians University in Germany, in an effort to stay on top of text mining developments.
“In order to be able to extract the sort of information that is precious to the user we had to really understand the nature of the information we are dealing with and also all of the ways in which the user may want to work with the information,” said Schuhmann.
Now, with a core technology of statistical and natural language processing approaches along with data and relational viewers and analysis tools largely in place, the Lion researchers are focused on gaining user feedback from the early adopters of the technology, such LBRI, Lion’s collaboration with Bayer in Boston, Mass. “Although results so far have been positive, there are always improvements that can make the technology even more powerful,” said Schuhmann.
Schuhmann said the project team would continue to explore post-processing technology for the data that is extracted as well as data representation techniques, ontologies, and standardization issues for biological objects. While Schuhmann is pleased with the results thus far he also has his sights firmly set on future stages of development.
“The next challenge will be to extract more diverse information than what we’re looking at the moment, such as biology-related patent information,” he said. “Still, given what we have accomplished so far, I can’t wait to see where we are in another year and a half.”