As the field of biomedical text mining matures, researchers are finding a number of new application areas for the technology. Reel Two, a text-mining software firm that focuses on the life-sciences market, is hoping to exploit this trend by launching several new targeted applications based on its flagship Classification System technology.
The first, SureGene, addresses the “gene disambiguation” problem in the biomedical literature, according to Nicko Goncharoff, senior vice president at Reel Two. Gene names are a particular challenge for text-mining systems because there are many synonyms for the same gene, as well as many gene names that are also common English words.
In collaboration with AstraZeneca, Reel Two ran all of MedLine through its Classification System to create a pre-filtered database of abstracts in which gene names are assigned to their Entrez/LocusLink IDs. Users enter a canonical gene name or any related synonym and the system will return a ranked list of selected abstracts that are about that gene, regardless of how that gene is referred to in the article. The database is updated daily to classify genes in new articles, Goncharoff said.
A beta version of SureGene 1.0 is currently available that covers 8,300 human genes (http://suregene.reeltwo.com), and Goncharoff said that an upgrade, version 1.1, which will cover 38,000 genes, is expected to be available in the first quarter of 2005. Version 2.0, which will have additional features, is slated for a spring release.
Reel Two is also partnering with cheminformatics firm OpenEye Software to develop a text-mining application called SureChem that will allow users to search the literature for chemicals using both structures or chemical names. Goncharoff said that most search engines currently scan the literature using either structure or keyword, “but there were no tools to bridge the gap.” SureChem uses Reel Two’s Entity Extraction technology in combination with OpenEye’s OGHAM text-to-structure conversion package to enable searches by structure or chemical name. An online demo of SureChem 0.1 is available now (http://surechem.reeltwo.com), and the package will be available for installation in mid-December. Version 1.1, which will include additional features such as phrase extraction, will be ready in the first quarter of 2005.
Goncharoff said that the more customer feedback Reel Two gets, the more ideas it comes up with for tweaking its text-mining software for specific application areas. He said the company will likely merge the functionality of SureGene and SureChem so that users can extract gene and protein names associated with a chemical search. In addition, he said, the company plans to offer a version of SureGene that enables researchers to retrieve literature related to genes identified in quantitative trait loci analysis. Reel Two has also applied SureChem to extract compounds from patent databases.