PubGene, based in Norway, is counting on the first-mover advantage theory to hold true in the biological text-mining market. Since company founders Eivind Hovig and Tor-Kristian Jenssen first cre ated an automated MedLine search method to build a database of gene interactions in 2001 [BioInform 05-14-01], a number of players have jumped in to meet the growing demand for better methods of extracting information from the biomedical literature. Niche players like Definiens and Ariadne are focusing on text-mining technology, and larger firms like Lion Bioscience, SPSS, and IBM are adding biomedical text-mining capabilities to their products.
But Hovig, who still maintains his research position at the Department of Tumor Biology at the Norwegian Radium Hospital, is un ruffled in the face of such competition. “I don’t really see it as a threat because it still to a large degree depends on enabling the biologist, and not all the big guys want to enable the user based on the user’s interest,” he said.
So far, it seems there is little cause for concern. Pfizer recently signed on for a global license to the PubGene database, adding to a list of site and global licenses that includes Millennium Pharmaceuticals, DeCode Genetics, Janssen Pharmaceutica/J&J Belgium R&D, the Norwegian Radium Hospital, and the Norwegian Microarray Consortium. PubGene also has a number of customers for single-seat or multiple-seat licenses, according to Dmitrii Rodionov, head of marketing and sales. DeCode and Singapore-based bioinformatics firm HeliXense have also purchased licenses to distribute the PubGene database and tools.
Demand for the technology is skyrocketing, according to Hovig. The company maintains a free version of its database at www.pubgene.org, and Hovig said that hits to that site have grown ten-fold, to 100,000 per month, over the last two years. Free access to the academic site — a limited version of its commercial offering — is an important part of PubGene’s strategy, Hovig said, because the company views it as “a major contributor in making people aware that PubGene exists … If people can see that it’s actually working and you do get some information, that’s worthwhile.”
The company is currently preparing for the next release of the PubGene database and toolset, version 2.1, which is due out at the end of May. PubGene generally releases two upgrades per year, Rodionov said, as well as monthly upgrades for its customers. The company also began collaborating with Peoples Genetics in January to create a “melting map” of the human genome, which provides information on the DNA reagents required to search for disease-causing mutations in all human genes. PubGene expects the project to result in a new public service as well as two new commercial products. While it is still too early to disclose further details on those products, “they will be distinct from the PubGene database and analysis tools package, and will most likely compete at a different market segment,” said Rodionov.
Future developments in PubGene will likely include improved linkages between sequence homol-ogy data and sequence information in the literature. The goal of this feature, Hovig said, “is that when two sequences are homologous, but only one has been actively published on, the other one is linked to that one when you do the literature search.” The company is also looking into adding patent searching capabilities, and has begun to expand beyond its current set of human, mouse, and rat data into yeast.
PubGene relies on statistical correlations of biological terms within MedLine abstracts to create its database, an approach that proponents of other text-mining techniques, such as natural language processing (NLP), often criticize for being too simplistic. However, Hovig noted, NLP doesn’t scale very well. “NLP strategies are nice for small [data] sets, but not as nice for huge sets,” he said.
In the meantime, he said, he’s working to improve existing statistical approaches of measuring the quality of literature-defined interactions, as well as to permit users to define their searches based on different knowledge domains. “For instance, you might want to have interactions where you have the perspective of functional domains,” he said. Hovig is also looking into methods of “more refined information finding based on indirect hits” — in which two different items have a third item in common — a feature that would be useful for disease mapping, he said.
Hovig’s technological wish list also contains a “scientific literature markup language” to map interactions or biological elements within biomedical texts. With electronic publishing and open access journals like BioMedCentral, Hovig said, “a lot more information could be harvested for a lot more people if full-text journals would tag basic [biological] elements.”