If there were any doubt as to why text-mining tool development for life science research should be the technology du jour, William Hayes, director of library and literature informatics at Biogen Idec, thinks that the numbers speak for themselves. "In the last 15 years, we've spent a trillion dollars on biomedical research, and almost all of our knowledge from that is captured in the literature — and we've barely scratched the surface of it," he says. "Our spend for content outmatches our spend on analytical tools by at least five to one."
And Hayes knows that the situation at his company, which "can afford to have a group that provides this capability for the entire company," is vastly superior to academia, which can't afford such luxuries. "And they're the ones that are producing the most advanced technology for this," Hayes says.
But many of the folks working on the academic side of text mining believe its future is only looking brighter, with the caveat that there is still much to be accomplished. "We've gotten to the point where text-mining tools such as ChiliBot, iHOP, or Textpresso are actually starting to become useful to bench scientists," says Larry Hunter, director of the Center for Computational Pharmacology and Computational Bioscience Program at the University of Colorado, Denver. Hunter's own lab has developed a tool called Mutation Finder, which is currently being used by the Protein Databank to help enhance the quality of its structure database. "Of course, no one can do biomedical research without PubMed, which is a text-mining tool of sorts, [but] we have a long way to go before text-mining tools are as useful as they can be."
To be fair, says Dietrich Rebholz-Schuhmann, leader of a text-mining research group at the European Bioinformatics Institute, bench biologists and biomedical researchers can be a tough crowd when it comes to developing tools that suit their individual needs. "This is a difficult task, since all researchers are experts in their domain and see limitations in the information extraction solutions due to disagreements to the produced results," says Rebholz-Schuhmann. "Nonetheless, there are a number of success stories linked to text-mining solutions."
The two areas in which a significant amount of headway has been made include information retrieval and the annotation of molecular products with literature information. Popular text mining solutions such as iHop, EBIMed, GoPubMed, and PubGene provide users with a defined focus and then tweeze out relevant information from a complete corpus of documents. For example, PubGene is capable of digging through the more than 25 million articles contained in PubMed to identify genes and proteins that it then displays as "literature networks." In these networks, genes and proteins are represented as nodes, and the connecting lines indicate where each gene or protein is co-cited.
As far as annotation text mining success stories, there are tools such as SherLoc, STRING, GoCAT, and GoAnnotator, all of which basically provide the researcher with clues on potential or hypothetical alternatives of the particular elements under scrutiny. GoAnnotator, which aims to solve the non-trivial and costly process of annotating proteins with gene ontology terms, works by linking uncurated annotations to text extracted from the literature. Text selection is based on similarity from the text to the terms originating from the uncurated annotation.
Open access issues
One of the biggest challenges facing text mining is the issue of free access to full-text articles. While there is certainly valuable information contained in abstracts, the real benefit of text-mining applications can only be truly realized with access to whole articles. "This is a major obstacle because so much valuable content can't be mined and indexed," says Lynette Hirschman, director of biomedical informatics for the Information Technology Center at MITRE, a nonprofit organization focused on systems engineering and information technology. "Though there is more open access literature all the time — some journals are making older issues available and some publishers are using text mining for better indexing — but it is still incredibly fragmented and hard to access full text."
The full-text access crusade was given a serious boost with the National Institutes of Health's decree in early January requiring that all NIH-funded investigators submit an electronic version of their peer-reviewed manuscripts to PubMed Central within 12 months of publication. "The full article is far more informative than just an abstract, and it is the only place where a scientist can judge whether the conclusions reached are justified," says Hunter at Colorado. "While the challenges in scaling text-mining systems to handle full texts are themselves difficult, the increasing access to full text is going to make text mining all the more effective."
Almost everyone involved in text-mining application research and development agrees that full-text access is key, but scaling text-mining systems to full texts has its own set of challenges simply by virtue of the number of words or phrases an application has to sort through. "One fairly obvious difference is the requirement to resolve co-references — that is, figuring out when the text says 'the protein' which of the many proteins mentioned previously is being referenced [because] abstracts have relatively few co-references, whereas full texts have a lot," says Hunter. "Another is document zoning [or] identifying the results versus the introduction, or associating a figure caption with the right figure."
Another common concern among text-mining developers has to do with the adaptability of these search solutions to workflows other than that for which they were originally developed. For example, there are many entity-tagging technologies out there, but almost all of them are customized and designed to work on a particular corpus with custom coding and custom result output. "The tools are there, the technology is ready, so it's really just figuring out the integration and deployment aspect," says Hayes. "There is no easy way to just grab and drop a particular cancer genotyping extraction or cytogenetic tagging tool into my workflow and go."
Text miners are forced to re-develop an attractive tool so that it fits into the specific workflow technology they happen to be using. Hayes says that this major technical stumbling block is disappointing because there is a lot of powerful academic code out there that he simply does not have the time or resources to tweak for his company's needs. He is not alone. Lynette Hirschman agrees that the overhead in tweaking text-mining tools to a new application is prohibitive. Ideally, she would like to see these tools used in the same manner as Blast — in other words, a tool that biologists and bioinformaticists can integrate into their individual processing pipelines.
'A distant vision'
For the time being, the consensus seems to be that text-mining solutions will remain hidden as part of tools used by the individual researcher. "Text mining will help databases do a better job in terms of both completeness and accuracy, and text-based methods will increasingly work their way into bioinformatics systems, such as improving performance in protein function prediction," says Hunter. "[But] the idea of a text-mining system that can keep up with the literature for you, or automatically find everything published that is somehow relevant to your specific research, remains a distant vision."
One thing Hunter and others would like to see is the ability to accurately recognize references to genes and gene products in text, including the ability to accurately map these genes to standard database identifiers. Hunter also participated in last year's BioCreative (Critical Assessment of Information Extraction in Biology) text-mining challenge, which first kicked off in 2004. "Judging by the BioCreative competition last year, we can recognize a mention of a gene or gene product in text with about 90 percent accuracy, but can only map them to database identifiers with about 80 percent accuracy, and that's when we know what organism the gene came from," says Hunter. "The general problem of normalization including recognition of the species is relatively unstudied, [so] trying to assemble these basic pieces into more complicated extractions remains difficult." The best performance on the competition's protein-protein interaction extraction task, submitted by Hunter's lab, was only about 40 percent correct.
EBI's Rebholz-Schuhmann says that at the heart of any effective text-mining solution must be the ability to analyze large sets of documents and deliver extracted results in a standardized way. "The names and concepts often represent either molecular objects or complex abstract objects, such as a disease, that cannot be defined in a formal way denoting all details of the scientific evidence that the researcher has ob-served," he says. "It is important that such textual representations are general enough to be relevant and meaningful for a larger community of researchers, and that they are not too specific to only the work of the researcher." Rebholz-Schuhmann is optimistic that the text-mining community will overcome the hurdle of alternative interpretations of unstructured text through improved standardization of the representation of semantics in the documents, where different text-mining solutions deliver alternative interpretations similar to different variants of classification or clustering techniques in the case of experimental data.
Researchers are also looking at ways to utilize image data in the mining process. Genome Technology's sister publication BioInform recently reported on a team at the University of Wisconsin-Milwaukee College of Health Sciences that is developing a platform to scan over images in an article and map them to the abstract. The project plans to build on work published by UW-Milwaukee Assistant Professor Hong Yu, who, in 2006, published a description of a prototype called BioEx, a user interface capable of associating captions from particular images to article abstracts. The difficulties involved in using image data are many. Different journals are not always structured to be machine-readable, and therefore it is not always trivial to obtain images. Variations in image quality present problems as well. "This is an active area of research — lots of exciting work on categorization of images, extraction of information from figure captions, association of figures with free text descriptions, use of images as a quick profile of an article," says Hirschman. "Since images are incredibly rich sources of information, I think this will be very productive for biology."
And while algorithms and software solutions may never totally master this most human of inventions, it can do things no curator or speed-reading scientist is capable of. "The lovely thing about language is that it is so rich in meaning and in structure, and you just won't be able to pre-tag the information while you're writing it because your needs downstream are going to be highly contextualized," says Hayes. "But text mining can come in and very quickly extract exactly what people need, and you're never going to be able to pre-generate that, no matter how good your database development is. That's the amazing thing."