Skip to main content
Premium Trial:

Request an Annual Quote

At OHSU, Hersh Seeks a Better Way to Search the Literature


Bill Hersh isn’t much of a bench scientist. Nor is he much of a computer scientist, either. But Hersh, an MD who heads the department of medical informatics at Oregon Health & Science University in Portland, is studying a problem that many scientists studying gene function can appreciate: how does a researcher accurately scour the literature for information about a new gene of interest?

Hersh has been intrigued by the study of information retrieval ever since he took a postdoc in 1987 at Harvard in medical informatics (to be fair, his postdoc did involve earning a certificate in computer science). In the past few years, he’s become involved in a project called the Text Retrieval Conference Genomics initiative, also called TREC, which provides a standard data set for investigators studying how to improve the search engines that comb the scientific literature. Like many text search engines, including popular services such as Google, the algorithms currently in use to search scientific databases are inexact, and often return much irrelevant information along with the intended target of the search.

The challenges in dealing with genomic data are frustrating but familiar: There are many names for the same gene, many proteins take the same name as the gene that encodes them, and often a gene is known by a common word. “Human languages are hard for a computer to figure out,” Hersh says. “They’re good at picking out one word from gigabytes of data, but it’s difficult for them to extract the meaning.” His ultimate goal, Hersh says, is to build better systems for retrieving meaningful information.

In 2003, the first year of the TREC Genomics project, Hersh assembled a data set that included an entire year’s worth of data from MedLine, a surprisingly small data set at 1.5 gigabytes, he says. Hersh and other academics studying information retrieval used the data set to compare the efficiencies of their search algorithms in a more genteel version of a typical vendor “bake-off.” It was more “learning” than “we’re better than you,” he says.

For this year’s version of the data set, Hersh hopes to go full text, and to make the queries his search algorithms undertake more realistic. To make sure his efforts remain grounded in the actual needs of scientists, he plans to send his students into OHSU biology labs engaged in genomics to collect specific examples of scientists’ most recent literature searches. His students will ask, “What’s the last question you had?” he says.

— John S. MacNeil


The Scan

Billions for Antivirals

The US is putting $3.2 billion toward a program to develop antivirals to treat COVID-19 in its early stages, the Wall Street Journal reports.

NFT of the Web

Tim Berners-Lee, who developed the World Wide Web, is auctioning its original source code as a non-fungible token, Reuters reports.

23andMe on the Nasdaq

23andMe's shares rose more than 20 percent following its merger with a special purpose acquisition company, as GenomeWeb has reported.

Science Papers Present GWAS of Brain Structure, System for Controlled Gene Transfer

In Science this week: genome-wide association study ties variants to white matter stricture in the brain, and more.