Bill Hersh isn’t much of a bench scientist. Nor is he much of a computer scientist, either. But Hersh, an MD who heads the department of medical informatics at Oregon Health & Science University in Portland, is studying a problem that many scientists studying gene function can appreciate: how does a researcher accurately scour the literature for information about a new gene of interest?
Hersh has been intrigued by the study of information retrieval ever since he took a postdoc in 1987 at Harvard in medical informatics (to be fair, his postdoc did involve earning a certificate in computer science). In the past few years, he’s become involved in a project called the Text Retrieval Conference Genomics initiative, also called TREC, which provides a standard data set for investigators studying how to improve the search engines that comb the scientific literature. Like many text search engines, including popular services such as Google, the algorithms currently in use to search scientific databases are inexact, and often return much irrelevant information along with the intended target of the search.
The challenges in dealing with genomic data are frustrating but familiar: There are many names for the same gene, many proteins take the same name as the gene that encodes them, and often a gene is known by a common word. “Human languages are hard for a computer to figure out,” Hersh says. “They’re good at picking out one word from gigabytes of data, but it’s difficult for them to extract the meaning.” His ultimate goal, Hersh says, is to build better systems for retrieving meaningful information.
In 2003, the first year of the TREC Genomics project, Hersh assembled a data set that included an entire year’s worth of data from MedLine, a surprisingly small data set at 1.5 gigabytes, he says. Hersh and other academics studying information retrieval used the data set to compare the efficiencies of their search algorithms in a more genteel version of a typical vendor “bake-off.” It was more “learning” than “we’re better than you,” he says.
For this year’s version of the data set, Hersh hopes to go full text, and to make the queries his search algorithms undertake more realistic. To make sure his efforts remain grounded in the actual needs of scientists, he plans to send his students into OHSU biology labs engaged in genomics to collect specific examples of scientists’ most recent literature searches. His students will ask, “What’s the last question you had?” he says.
— John S. MacNeil