A team of researchers at the University of Wisconsin-Milwaukee College of Health Sciences is developing a computational platform designed to retrieve biomedical images from journal articles in response to text-based queries.
The project seeks to add a new dimension to the biomedical text-mining field by merging text with image data, which would capture valuable experimental information as part of the literature search process.
The goal of the project is to use natural language processing tools to identify textual statements that correspond to images in journal articles, and then to link these images to appropriate sentences in the abstract.
Text-mining as a discipline is still relatively unproven, however, and throwing image analysis into the mix adds an extra layer of complexity to an already difficult task.
The National Center for Research Resources thinks its worth studying the problem and recently awarded the team an R21 “exploratory” grant to collect evidence to apply for a longer-term R01 grant. R21 awards are generally for two years and have a funding cap of $275,000.
“It’s a very challenging task,” assistant professor Hong Yu told BioInform. “People might think it’s very easy because typically … when people write articles, they write the text based on the order of the images, but we found that’s not the case.”
Yu said that her group has so far analyzed more than 100 articles and found that only 30 percent followed that sequential order, while the remainder exhibited very little organization at all. Therefore, the researchers decided to develop an automated system to semantically organize the images and map them to the abstract.
In 2006, Yu and colleagues published a paper in Bioinformatics describing a prototype user interface called BioEx for mapping image captions to sentences in article abstracts.
The researchers hope to move beyond this work in their current project, which aims to develop methods that automatically classify images into particular experimental categories, as well as tools that automatically assign Gene Ontology categories to experiments.
“Traditionally, if people wanted to map to the Gene Ontology, they would use just text descriptions,” Yu said. For example, if the text includes the term “protein phosphorylation,” then most current systems would assign that article to that GO term.
But this approach is limited, she said. “Language is quite complex. If you mention protein phosphorylation, frequently you don’t know what protein phosphorylated or whether it was describing other people’s work or describing this work.”
“For a useful system, we need to have over 70 percent accuracy, but currently we have only about 40 percent accuracy.”
Furthermore, text is fraught with ambiguities, “so you don’t know where a statement comes from and what evidence supports that statement, and really what that statement is about. So it’s our opinion that using an image-centric approach will give more solid proof and will actually have the experimental evidence.”
Yu said that her team is tackling this problem by first identifying the “key sentence” that describes a given image, and then looking for similarities between that sentence and the sentence in the abstract.
“We’re also developing techniques to automatically classify a sentence that appears in the full-text article into different sections, such as whether the sentence provides background information, whether the sentence provides methods, or whether the sentence provides results, or whether it’s a conclusion,” she said. “Using that, then we will be able to map between the sentence in the abstract and the image in the full-text article.”
For example, she noted, “it’s very highly unlikely that if a sentence describes the background knowledge that it would correspond to a particular image, because most of the images are results or conclusions.”
Nevertheless, Yu acknowledged that the problem is a challenging one. “For a useful system, we need to have over 70 percent accuracy, but currently we have only about 40 percent accuracy.” She said that one goal of the NCRR-funded project is to double that performance “so that biologists can really benefit from it.”
‘Slowly but Surely’
Phil Bourne, of the Department of Pharmacology at the University of California, San Diego, and editor in chief of PLoS Computational Biology, has been tracking the biomedical text-mining field closely and told BioInform that the goals of Yu’s project sound promising.
“The idea of going to an abstract and immediately being able to see the figures from the paper -- which are often the essence of the experiments – could potentially be pretty good,” he said.
He cautioned, however, that the UWM project and most other biomedical text-mining initiatives are just “tantalizing beginnings” in a field that has been slow to advance.
“In fairness, it’s a very hard problem,” Bourne said. “Getting a computer to understand the semantics associated with language and what’s really being said in a scientific article is by no means trivial.”
So far, he said, most people in the field have focused on “low-hanging fruit” like recognizing gene identifiers or other terms, but even these methods have demonstrated accuracy in the range of 85 percent or lower, he said. “Just recognizing terms that are fairly unique and valuable is still not a solved problem, let alone trying to unravel the complexities of language to get new information.”
Nevertheless, he thinks “it will happen slowly but surely.”
Yu said she hopes to have a system in place that researchers can use within the next three to five years. In the meantime, she said, she is looking into methods to help biologists annotate journal articles before they submit them in order to make text-mining easier.
Bourne said that his group at UCSD is engaged in a similar effort. “Rather than try to post-process a paper to actually add some kind of semantic value, why not actually have the author add that semantic value as they write the paper?” he said.
He said that the UCSD team is currently working with Microsoft to develop a plug-in for Word that would help biologists add semantic meaning to their texts. “As the author types, if they type a common gene name, for example, it goes off and looks in a pseudonym table of all the references between that common name for the gene and a more systematic name that’s in the Gene Ontology. And it basically offers the author the opportunity to make that substitution,” he said.
Bourne said that text-mining development could also benefit from the National Institutes of Health’s new public access policy, which requires that investigators using NIH funding must submit an electronic version of their final, peer-reviewed manuscripts to PubMed Central within 12 months after official publication.
“That’s going to increase the richness of full text online, which will really foster more development in this whole text-mining area,” Bourne said.