Researchers at Yale University have added a new tool to the biological literature-mining arsenal that allows users to retrieve images from published papers by searching the text within the figures — an approach that differs from other image-retrieval methods that rely on captions or other descriptions of the image.
“There is a lot of information in the image itself that you can only access using a system like [the one] we have developed, where you can peek inside the image,” said Michael Krauthammer, an assistant professor at Yale University School of Medicine’s department of pathology, who is the principal investigator of the project that led to Yale Image Finder, or YIF.
YIF finds images by searching the text within the images of a given journal article. Those images can include charts, diagrams, photos, and micrographs, Krauthammer said. Currently, YIF allows users to search the content of over 34,000 open access articles from PubMed Central. The articles, which are stored locally at Yale and updated every week, currently contain around 140,000 images.
Krauthammer, along with Songhua Xu of Yale’s department of computer science and James McCusker of Yale’s department of pathology, described the software in a paper published in the online version of Bioinformatics last month.
According to the paper, YIF returns results in the form of thumbnail figures that users can click to retrieve the high-resolution image. YIF also provides two types of related images: Those that appear in the same paper, and those from other papers with similar image content. Retrieved images link back to their source papers, the authors write.
The project is supported by a grant from the National Library of Medicine’s Text Mining as a Translational Tool in Biomedicine program. In the abstract for the grant, Krauthammer and colleagues described the tool as a “text mining based translational informatics tool” intended to help researchers better analyze data from whole genome linkage scans and other high throughput studies.
“It’s still unfolding,” Krauthammer told BioInform in an interview, adding that he is still collecting feedback from users. He said he would like YIF to access all of the literature, not just articles from open access journals, but “there are quite a lot of copyright issues surrounding this.”
The search engine finds images in papers by mining the text within the images. This is a markedly different approach from other projects that search through captions to find images, he said.
For example, BioText, developed by computer scientists Marti Hearst at the University of California at Berkeley, is a search engine to find images through captions.
Computer scientist Hong Yu of the University of Wisconsin’s Milwaukee College of Engineering and Applied Science is also exploring image retrieval in a project that studies text associated with images in a given journal article, such as captions and sentences in the abstract or full text [BioInform 02-29-08].
“There is a lot of information in the image itself that you can only access using a system like [the one] we have developed where you can peek inside the image.”
In addition, the Subcellular Location Image Finder, SLIF, is a Carnegie Mellon University project led by computational biologist Robert Murphy that classifies and segments fluorescence microscopy images.
SLIF is of a mix of different methods, including traditional morphological image processing with image segmentation as well as textual recognition to obtain the labels of the images. “But it is not about the text inside the images, it’s really, as I understand it, the findings in the fluorescent image,” Krauthammer said, adding that it’s “ quite different” than YIF.
Yu’s team at UW is designing an interface to visualize images and sentences and to retrieve images from journal articles with a text-based query based on natural-language processing, but the system is not yet implemented, she told BioInform this week.
Yu said that YIF appears to be similar to Hearst's system as well as Google’s image search. “[Krauthammer’s] group basically indexes image captions and text features from the image using a simple query-term based approach,” she told BioInform in an e-mail.
Krauthammer noted that YIF differs from the UW project because “they identify text in the body or abstract of the article that most probably describes the content of the image. They are not looking at the text within the image.”
Krauthammer said he does not know of other tools like Yale Image Finder, which has a “more global” target than other retrieval systems: the textual elements in the image itself, which are “important in all images.” Heat maps or diagrams, for example, are not retrievable any other way, he said.
Don’t Mention It
One aspect of image searching that captured Krauthammer’s interest is that textual information about images is often absent in a given journal article. “Often [images] contain information you just can’t find in the text,” he said. Even if a caption mentions an aspect of an image, captions are usually broader in their scope than the images themselves.
For example, the results and details of a gene expression heat map or biological pathway are not spelled out in a caption, he said.
When you look inside the image, the text is specific to the information in the image, so “the precision of your search is pretty high using this type of approach,” he noted.
The search technology he and his colleagues developed uses histogram-based image processing techniques to identify the textual elements, for example words or even single letters in the image. These texts are then run through an optical character-recognition analysis pipeline.
“What we essentially do is we [take the whole image] and cut out small images, almost with scissors surrounding the text blocks,” Krauthammer said. “After that we throw away everything else, more or less, and feed just those snippets of text to the optical character recognition engine.”
That text extraction is repeated after turning the image 90 degrees to, for example, capture labels in x-y graphs. Text could also run at other angles, but, he said, “right now we said [90 degrees] is a good compromise.”
Next, in order to minimize false-positive query results, the text extracted from the image is run against the full text of the journal article. Image text that is mentioned in the article is retained.
Test the Recall
YIF users can decide to obtain results in either “high precision” mode or “high recall” mode.
High recall delivers a larger number of images, while in the high-precision mode, users can be almost “100 percent sure” that the queried information will be in the image, he said.
According to the Bioinformatics paper, tests with 161 randomly selected images showed that YIF can deliver around 65 percent of the image text content at 27.9 percent precision in the high recall mode and 38.5 percent of the image text content at 87.7 percent precision in high-precision mode. “I think the most important [number] is that you can expect that roughly 60 percent of the text image content [is] accessed. That is the number that makes me very optimistic that this is actually working,” he said.
Although the pipeline works, he said, there are still issues as always with new technology. “Our ability to find material inside the images is pretty complete, and that is high recall mode,” he said. “But it also produces a lot of garbage.” Sometimes the engine finds a string of characters or words in the image that are actually not there.
“In high precision mode, we weed out all these wrongly recognized words inside the image, but then you have a lower recall, meaning you would find fewer images; so it’s a trade-off.”
He said that his team is still optimizing YIF’s ability to retrieve related figures as part of an answer to a given query in order to enable “literature navigation via images.”
For now, Krauthammer and his group plan to continue updating YIF with the latest open access content to assure usability. “Obviously it is not complete, we would like to have all the scientific articles in it, but people have started using it,” he said.
The researchers are also scaling up YIF’s processing power. “We hope to expand it but at this point we want to make it a robust tool that really works to satisfy the research community,” he said.
Krauthammer said that his team plans to release the software under an open source license and is “interested in working with people who want to explore it academically or commercially.”
Image mining is complementary to pure text mining, he said, adding, “I just felt that images haven’t been really tackled.”
After working on text-mining methods for over a decade, he feels image retrieval is a second approach to search, one that expands on human leanings. “As humans we are very good with images and with image processing,” he said. “We have a much better ability to quickly assess the content of an image compared to text.”