Skip to main content
Premium Trial:

Request an Annual Quote

Text Mining: Help is On the Way


Poor Dataslave. We’ve all been there, trying to cull a few interesting documents from a large pile of boring ones. The only solution, until recently, was brain grease: read every document — the abstracts, anyway — and make a judgment call.

But that’s starting to change. Software vendors have felt our pain and are busily at work crafting new products that automate literature searching. There are many choices: Some products try to understand the content; others try to learn what you’re looking for by watching you classify a small number of examples; some analyze the entire literature, creating yet another database for you to search; others work on subsets that you select; others build on top of these products and let you combine analyses from multiple tools. There are some open-source packages, as well.

Dataslave needed some help, so I grabbed one product and two open- source tools and went to work. Before long, I had the answer and some confidence that it was right. The software helped and I’m glad I had it, but it wasn’t as easy as I hoped.

Where to Start

The starting point for any literature search is, for most of us, Medline. I generally access Medline through the Entrez query system operated by NCBI. The European Bioinformatics Institute also provides access through SRS, a data query and integration system originally developed at EBI and now commercialized by Lion Bioscience.

The US National Library of Medicine’s Medline is an indexed database of biomedical articles and abstracts. “Indexed” is used here in its traditional literary sense: human experts assign index terms, such as “mice” or “Huntington’s disease,” to each citation. Readers can retrieve articles by index terms, as well as full-text search, and other fields. The index terms are drawn from a hierarchical, controlled vocabulary called Medical Subject Headings.

NCBI offers PubMed, a superset of Medline that includes some non-biomedical articles from journals that are indexed by Medline, articles from journals that submit full text to PubMedCentral, and articles from journals now covered by Medline that were published before coverage began.

In simple cases, the query syntax of Entrez / PubMed looks a lot like a web search engine: just type in some words and it’ll do something sensible. But beware. There’s more going on than meets the eye.

The first step is automatic term recognition: the software looks in a database of common biomedical phrases and translates your query into what it thinks you meant. In my experience, this usually does the right thing. But it doesn’t always work as expected, and you have to be fairly attentive to avoid turning the feature on or off inadvertently in complex queries. Fortunately, PubMed provides a “details” button you can hit to see how the software translated your query.

PubMed also lets you specify query terms by field — to search for articles by a particular author or to look for words that appear in the title or abstract. You can also combine search terms using AND, OR, NOT, and parentheses. There’s a great PubMed tutorial online with many more details.

An important subtlety is that the Medline indexers, while quite good, do mis-index some documents. A related problem is that documents can sit in the database for a while before the indexers get to them. To capture such documents, you have to add search terms that look for words that you expect to find in the documents you want. This is definitely a double-edged sword, as such words will often pick up documents that are only marginally related to your question.

SRS processes queries differently from PubMed and often gives different results for seemingly equivalent queries.

Grabbing the Big Pile

The query I used for this article is: Huntingtons disease AND (mice [Title/ Abstract] OR mouse [Title/Abstract] OR murine [Title/Abstract]). The leading phrase “Huntingtons disease” triggers automatic term recognition and is translated into “Huntingtons disease [MeSH Terms] OR Huntingtons disease [Text Word].”

The overall query finds documents that are indexed under the term “Huntington’s disease” or that contain the phrase “Huntington’s disease” in the title, abstract, or other text fields of the entry, and that contain the words “mice,” “mouse,” or “murine” in the title or abstract. This query found 261 documents when I ran it to prepare this article.

This proved to be a good starting point for Dataslave’s task, far better than the clunker Carbonoid dropped on him. Naturally, queries that differ even slightly may produce radically different answers. I tried lots of formulations before settling on the one above.

For verification purposes, I also worked with a more expansive query — (Huntingtons disease OR Huntingtons [Title/Abstract]) AND (mice OR mouse [Title/Abstract] OR murine [Title/ Abstract]) — that retrieved 445 documents.

I downloaded the search results from PubMed using the “save” button they so nicely provide. This produces a text file containing the basic citation information for each document.

The next step was to get the abstracts. I planned to use BioPerl for this, but the PubMed interface wasn’t quite ready when I needed it. Fortunately, NCBI provides an easy way to download batch datasets from Entrez, called Entrez Utilities (E-Utilities). You just have to cobble together a URL that lists the identifiers of the entries you want, send it to NCBI (via the Perl GET utility or lwp module, for example), and they send back your data in XML, HTML, or text.

I got the abstracts in XML and processed them with BioPerl’s Biblio module. Biblio can split the XML stream into separate records for each abstract, and provides functions for grabbing fields like authors, title, body of the abstract, etc. I wrote a little script that created a separate text file for each abstract containing the information I cared about.

Finding the Gems

This is the point where I turned to a fancy new knowledge-mining tool, Classification System from Reel Two. My goal was to divide the big pile of documents in two — into a small pile of papers about mouse drug studies in Huntington’s disease, and a larger pile about other things.

I should mention that although this example only involves binary classification — drug studies vs. other — Classification System is more general. It can handle any number of classes, the classes can be arranged hierarchically, and documents can be placed into multiple classes.

Reel Two’s CS uses an approach called machine learning — a rather grand term for a simple idea. You give the program examples of documents that fall into each class, and it builds a mathematical model that can reproduce this categorization. Then you feed in new documents, and the program uses the model to put each document into the best class. The initial examples are called the training set and the new ones are called the test set.

I also tried two open-source packages based on similar methods: Rainbow by Andrew McCallum, now at the University of Massachusetts at Amherst, and AI::Categorize by Ken Williams, a Perl guru and the original Dr. Math of the Math Forum.

The central modeling method in all three programs is naïve Bayesian inference. Here’s the basic idea: To build the model, the program calculates a word profile for each class that tells how often each word appears in the class’s documents. To use the model, the program finds the class whose profile best matches the profile of a new document, using Bayes’s theorem to calculate “best.” It takes a lot of tricks to turn this basic idea into a practical program.

CS is a much more polished program than the other two. It’s written in Java and has an easy-to-use, point-and-click user interface. Rainbow, written in C, is the most full- functioned of the bunch, providing a wide range of modeling methods (only some of which seem to work), and the most detailed output reports as to what the model is doing. AI::Categorize is a Perl module and is the easiest choice if you’re looking for something to plug into a Perl script.

The Experiment

I trained each program on eight positive papers provided by a colleague and 25 arbitrarily chosen negatives. I used the same negatives for each program and, of course, manually checked to make sure they were true negatives.

I also went through the entire dataset by hand so I would know the correct answer. I found 17 papers that clearly belonged to the positive class, and another nine that were marginal in that they were about non-drug therapeutics — transplantation, gene therapy, and environmental enrichment. I decided to expand my positive definition to include all 26 of these documents.

To verify that my initial PubMed search was reasonable, I ran the more expansive query mentioned above and manually classified the extra 184 documents it found. Only four of these fit the positive definition, and only one was really relevant. (The other three were commentaries and such.)

I then ran the entire dataset through each program to see how many documents it could classify correctly. I quickly learned that none of the programs could do a complete job in one try.

The typical outcome was to predict 10 or 20 new positives, of which maybe half were correct. This isn’t what I had hoped for, but on reflection I realized it wasn’t so bad. Since only 10 percent of the documents in the initial dataset were true positives, having the program find a class that’s 50 percent positive is a considerable step forward.

So, I switched to an iterative strategy. I trained each program as above. Then I ran the program to get an enriched class of predicted positives. Then I manually classified the predicted positives as true positives or true negatives, adding each to the training set. Then I retrained the program and did it all again, and again and again until the results no longer changed.

Using this approach, Reel Two’s product got to the correct answer in four iterations. Neither open-source program could get all the way, getting stuck at 16 answers each.

This new literature-searching software is not a panacea, but it certainly helps. I was hoping for an automatic solution, but instead had to resort to an iterative approach. Reel Two’s Classification System outperformed the two open-source packages, getting the correct answer after a few iterations. It was harder than I wanted, but at least it worked.

The literature is a very messy dataset. I suspect the products will need a few more revs to really get their arms around it, and I hope the vendors have enough patience (and money) to stick with it. In the meantime, Dataslave — and many of us — can look forward to reading lots of boring abstracts.




Company / Developer URL
Classifier System Reel Two


Andrew McCallum, University of Massachusetts at Amherst


Ken Williams, Perl guru and original Dr. Math of the Math Forum


Site Developer / Institution URL


US National Center for Biotechnology Information


European Bioinformatics Institute

PubMed tutorial


Entrez Utilities

MORE FREE HELP Though not mentioned in the story, the following sites automate the periodic running of PubMed queries. All are free of charge. PubCrawler and BioMail are open source.


Developer / Institution URL


Karsten Hokamp, Ken Wolfe, Trinity College Dublin


Dmitry Mozzherin, State University of New York at Stony Brook

The Scan

New Study Highlights Role of Genetics in ADHD

Researchers report in Nature Genetics on differences in genetic architecture between ADHD affecting children versus ADHD that persists into adulthood or is diagnosed in adults.

Study Highlights Pitfall of Large Gene Panels in Clinical Genomic Analysis

An analysis in Genetics in Medicine finds that as gene panels get larger, there is an increased chance of uncovering benign candidate variants.

Single-Cell Atlas of Drosophila Embryogenesis

A new paper in Science presents a single-cell atlas of fruit fly embryonic development over time.

Phage Cocktail Holds Promise for IBD

Researchers uncovered a combination phage therapy that targets Klebsiella pneumonia strains among individuals experiencing inflammatory bowel disease flare ups, as they report in Cell.