Efficient text mining technology has long been on the wish list of informatics directors seeking an automated way to glean knowledge about biological systems from journal articles and patent records. With this kind of demand, supply usually isn’t very far behind, and this is surely the case in bioinformatics-related text mining.
The last two weeks provided ample evidence of this trend: Software vendors looking to sell into the life sciences research market are placing text-mining capabilities high on their functionality priority lists, and this strategy is paying off in the form of sales. Reel Two, a two-year-old data- and text-mining software firm based in San Francisco, signed AstraZeneca as its first “marquis” customer for its Reel Two Classification System; while statistics veteran SPSS announced a deal with the University of Pennsylvania’s Abramson Family Cancer Research Institute. In addition, OmniViz, a Maynard, Mass.-based provider of visualization and data-mining software, released a new version of its software, OmniViz 3.5, with an added emphasis on text-mining capabilities.
Gavin Fischer, an application scientist at OmniViz, said he’s seen increased demand for the software’s text-mining capabilities from potential customers over the last year or so. “Interest in the text-mining side is huge,” he said. “Everywhere we go that is huge.”
A number of other software vendors, including Definiens, Virtual Genetics, PubGene, ClearForest, and even IBM, are already peddling various forms of text-mining technology in the bioinformatics sector [BioInform 09-23-02], but what sets the new batch apart from these other firms is a focus on selling combined data-mining/text-mining solutions. All three firms stress the fact that they are data-mining companies — text is essentially another form of data that can be plugged into the data-mining interface and manipulated graphically after it is harvested from the literature.
In the case of SPSS, the company’s LexiQuest Mine text mining tool combines linguistics and statistics to identify key concepts, and the relationships between them, in biological texts. This information can then be used in the company’s Clementine data-mining software to derive even further relationships, said SPSS senior analyst Cathy DeSesa.
Michael Liebman, director of computational biology and biomedical informatics at the Abramson Institute, said his group plans to apply the software not to Medline abstracts, as is the case with most text-mining projects, but to full journal articles. “The abstract is already a filtered view of the data,” said Liebman. “We’re looking for full syntactic analysis, not just a word list. This allows us to map concepts, map syntax, and build ontologies.” The biggest barrier to this approach is not the technology, said Liebman, but ready access to the full text of journal articles. His group has created a prototype of the system using breast cancer data in order to convince journal publishers of the scientific value of providing full-text access.
Liebman said the SPSS technology would also be used as part of a Pennsylvania Cancer Alliance project to build a conceptual data warehouse linking clinical and genomic information across cancer centers in the state. Adding literature-derived information into the mix will help create “a normalized data structure and a normalized metadata layer” for the whole process, he said.
The text-mining software has already helped Liebman’s group “find things we didn’t necessarily associate with each other before."
DeSesa said that LexiQuest Mine has a number of other customers in bioinformatics research, including Duke University, AstraZeneca, and “four other pharmaceutical companies.”
AstraZeneca is also a Reel Two customer, along with “a handful” of other pharmas, biotechs, and academic groups, said Nicko Goncharoff, the company’s senior vice president. Goncharoff said Reel Two’s patent-pending technology is rooted in artificial intelligence and machine learning expertise carried over from its founders’ previous experiences at AI firm Webmind. Reel Two’s software is easy to use and fast, said Goncharoff. One spin-off product created with the Classification System — the Gene Ontology Knowledge Discovery System (GO KDS) — classifies 12 million Medline entries using GO terms, and took only 45 minutes to calculate, Goncharoff said. In addition to GO KDS, Reel Two is “working with a couple of biotech and pharmaceutical firms on projects that will grow into new solutions,” he added.
Reel Two expects to release an upgraded version of the Classification System in the second quarter, and Goncharoff said a few more sales announcements may be in the near-term future as well.
OmniViz, meanwhile, is also finding success in the marketplace and claims Johnson & Johnson among the “who’s who of pharmaceutical firms” using its text-mining technology. The company is not targeting smaller firms at all, Fischer said, because it offers “broad-based solutions” based on its integrated data- and text-mining technology that are “better suited for larger firms.”
The focus for the new release of OmniViz 3.5 was improved speed, Fischer said. The software is now able to visualize a thousand Medline documents in under a minute on a desktop computer, and up to a million references a day.
The company, a Battelle spin-off, claims that its patented visualization capabilities set its text-mining tools apart from the pack. “We’ve spent over 20 years in data visualization…We did not start life to cluster Medline, we started life to serve the intelligence community,” said Fischer. However, as long as there’s a demand for the technology, OmniViz is happy to meet its customers’ needs, and the capability has come in handy as a useful sales tactic: “You can get in the door with that buzzword of text mining,” Fischer said. Once customers see that it works, the rest of the data-mining and visualization package is an easy sell, he added.