A new web-based system from Spain’s Centro Nacional de Biotecnología promises to improve PubMed searching, but not through the application of the latest and greatest text-mining algorithm. Rather, the system, called iHOP (Information Hyperlinked over Proteins), accounts for the inadequacies of current information retrieval technology by putting the researcher — not the computer — at the heart of the search process.
Text retrieval methods are of great interest for bioinformatics researchers looking for a better way to slog through the rapidly growing pool of PubMed abstracts, but “there is no perfect text-mining tool,” said Robert Hoffman, a scientist in the center’s protein design group. “A [computational] system cannot replace a human expert, and this will probably be the case for at least the next five to 10 years.” The best methods available today, Hoffman said, offer around 90 percent accuracy in terms of retrieval and precision, but even a 1 percent error rate can have serious consequences for a biologist who requires precise knowledge about a gene or set of genes of interest.
Hoffman and his colleagues are quite familiar with both the promise and the limitations of text retrieval methods for the scientific literature. Led by biomedical text-mining pioneer Alfonso Valencia, CNB’s protein design group co-hosted the BioCreative (Critical Assessment of Information Extraction systems in Biology) competition in March, where a number of groups evaluated their text-mining systems on a series of predetermined biological research tasks. “Although the groups are doing fine and there is improvement, they are far from perfect,” Hoffman said.
To account for this embryonic stage of technology development while still improving navigation of the biomedical literature, Hoffman and his colleagues hit upon a compromise. The result, iHOP, uses text-mining technology to guide users through the PubMed maze, but doesn’t do all of the work. Researchers begin with a gene or protein of interest, and then leapfrog through the literature via hyperlinked gene names located within sentences taken from source abstracts. This offers researchers a shortcut in sorting through 14 million PubMed abstracts, while allowing them the final say in determining the relevance of the information, Hoffman said. This approach that differs from the “black box” style of PubGene and other methods, which automatically map interacting genes and proteins onto graphs based on their co-occurrence within the text, but don’t provide the original source material that supports these interactions so that users can gauge their accuracy. Such representations “could give a misleading sense of confidence to the users and cloud the relevance of individual associations,” wrote Hoffman and Valencia in a brief article to discuss the system in the July 1 issue of Nature Genetics.
The basis for the system is a relational database that stores text sources related to around 30,000 genes for humans, mice, D. melanogaster, C. elegans, zebrafish, A. thaliana, yeast, and E. coli, along with all the synonyms, spelling variations, and capitalization idiosyncrasies associated with those gene names. Hoffman said that creating and maintaining the synonym list — currently in the range of about 4 million names — is one of the trickier aspects of the project, but one of the keys to making it work. This database is updated monthly, Hoffman said.
Each gene has its own XML-based information page that includes every sentence in which it is named along with a potential interaction partner. Sentences that include proteins with experimentally verified interactions are ranked at the top of the list. In addition, when a user arrives at the page of one gene from another gene, those sentences that associate the two genes have a higher ranking. Associating verbs — “suppresses,” “regulates,” “activates,” and the like — are also used to determine ranking. Users can maintain a “gene model” during each navigation session to build a network of interacting genes based on their query results.
Hoffman said that the goal of the CNB team was to develop a system that could “mimic” the way Google enables iterative exploration across a network of related information. “Outside of science, we’re spoiled because we know that with only a few clicks, we’ll end up [getting] at the information we’re seeking,” Hoffman said. “But in biomedicine, there is nothing like Google.” While acknowledging that there is still a way to go before iHOP becomes the Google of the PubMed set, Hoffman said his team “expect[s] that there will soon be automated systems that take up this idea and automatically crawl through this network.” While this is not a new concept for the Web, it is for the biomedical literature, he said. With iHOP as a starting point, “You could say to an automatic system, I have this gene and this gene: Find me the fastest way through the network that connects these two genes, or what are the sentences that connect these two genes? So this is probably the next step.”
The iHOP server is publicly available at http://www.pdg.cnb.uam.es/UniPub/iHOP/.
— BT