Skip to main content
Premium Trial:

Request an Annual Quote

Solutions: AstraZeneca Finds a Multi-Step Approach Is Best for Mining Biomedical Text


A number of academic research groups and commercial firms are developing software to extract relevant information from the biomedical literature, but a project at AstraZeneca suggests that none of these efforts is likely to work effectively on its own. When a collection of these tools is assembled correctly, however, the results are quite promising, according to the team conducting this project.

William Hayes and his colleagues in the discovery informatics group at AstraZeneca have just wrapped up the pilot phase for an in-house text-mining system they developed to support the company’s 3,000 researchers worldwide. After evaluating a number of commercial text-mining tools and developing a few of their own, Hayes said that the team found that a combination of approaches was the best solution.

“We have a suite of applications that work together, and they progress from document collection and corpus generation through to knowledge management,” Hayes said. The first step in the system is built on an application called QUOSA (Query, Organize, Share, and Analyze), sold by a company of the same name. The software searches the full text of articles in Medline — not just the abstracts — and is therefore able to retrieve important information that would be overlooked in abstract-based searches. “The key thing about text mining is that you can’t get away from needing the context of an actual document that a concept was extracted from,” Hayes said “You need to know the experimental, the development stage, the environmental conditions under which [a finding] is actually true, and the only way you can know that is to read the full paper.”

Once QUOSA delivers a list of articles that contain an initial search term or phrase, these papers are run through a statistical co-occurrence analysis tool that the AstraZeneca team developed based on internal thesauri and ontologies specific to the company’s disease research areas. There are terms associated with cancer that are of particular interest to the company’s oncology research group, for example, along with a separate set of terms for central nervous system disorders, inflammation, and so on. This stage of the process analyzes how often these terms co-occur within the set of articles identified by QUOSA. Hayes said that his team is currently evaluating several vendor-supplied tools to replace this home-grown set-up.

Next, the system uses two natural language processing applications. One is based on Ariadne Genomics’ PathwayAssist program, which pre-generates a database of all the protein interactions in Medline to create an interaction network. Researchers can use this desktop-based visualization tool to study the proteins highlighted within their articles in the context of biological interactions. In addition, the company is using a server-based tool that combines BioWisdom’s ontology with an interactive natural language processing system from Linguamatics. That application, called OBIIE (ontology-based interactive information extraction), “is probably the core of our effort,” Hayes said. “Once you know enough to ask really specific questions, this is what we use to generate significant databases.” As an example, he said, the company used OBIIE to quickly build a database of nuclear receptors and co-factors: “You just enter the ontology node for all the nuclear receptors and their aliases, and a set of co-factors, and then essentially get the interaction network out of the literature from that.”

Hayes said that the AstraZeneca team is still analyzing the performance of the system, but initial tests indicate that it is reasonably accurate and, more importantly, fast. In one example, it took three days to analyze 300 abstracts manually, but took the text-mining system only about a half hour. Accuracy can range anywhere from 10 percent to 90 percent, but mostly falls within the range of 30 percent to 40 percent. In terms of the system’s recall, or ability to catch all the relevant articles, “[It’s] going to be lower with text mining [than with manual curation],” Hayes admitted. “However, what people miss is [that] your practical recall — what you would realistically be able to get out of the literature — is much greater. Because if your practical recall is half of what you would get from a manual effort, but your manual effort can only cover about 300 abstracts versus the 7,000 that you actually need to read through, then your practical recall is much greater than your theoretical recall [from the manual effort].” In addition, he said, “You also have to keep in mind that people aren’t 100 percent accurate.”

AstraZeneca estimates that the system could save tens to hundreds of millions of dollars per year in terms of time saved by researchers — not to mention the fruits of their text-based labors. One promising application area that AstraZeneca has identified so far is safety assessment, where researchers rely on the literature to help them determine a compound’s toxicity characteristics. “If you kill a project sooner, you can divert a lot of resources to projects that are more likely to be successful projects, and you can kill them sooner because you can actually frontload the tox tests that are more likely to be indicative of success or failure,” Hayes said. “So if you find out in the literature that, for a particular target, there have been indications of kidney issues when you increase the expression of the gene, then you start off with a kidney tox test, and look for issues there.”

Hayes said that the next step in rolling out the project across AstraZeneca involves training the company’s librarians on the system, because the library group will ultimately be responsible for delivering the service to the company’s researchers. In practice, he said, “The text-mining specialists in the library will have to sit down side-by-side with a domain specialist and interactively work through the information that they want to pull out.” Hayes said that AstraZeneca hasn’t set any strict timelines for deploying the system company-wide, but is taking an “organic” approach to rolling it out on a larger scale.

Hayes ultimately hopes to extend the results of the project beyond AstraZeneca. He said he’s in the early stages of helping to organize a user community with other pharmaceutical industry members to encourage the exchange of best practices in the biomedical text mining field. IP issues are not a concern in an effort like this, he noted, “because our competitive advantage at AstraZeneca is how well we do cancer and CNS research, it’s not how well we develop systems for analyzing the literature.”

— BT

Filed under

The Scan

Study Finds Sorghum Genetic Loci Influencing Composition, Function of Human Gut Microbes

Focusing on microbes found in the human gut microbiome, researchers in Nature Communications identified 10 sorghum loci that appear to influence the microbial taxa or microbial metabolite features.

Treatment Costs May Not Coincide With R&D Investment, Study Suggests

Researchers in JAMA Network Open did not find an association between ultimate treatment costs and investments in a drug when they analyzed available data on 60 approved drugs.

Sleep-Related Variants Show Low Penetrance in Large Population Analysis

A limited number of variants had documented sleep effects in an investigation in PLOS Genetics of 10 genes with reported sleep ties in nearly 192,000 participants in four population studies.

Researchers Develop Polygenic Risk Scores for Dozens of Disease-Related Exposures

With genetic data from two large population cohorts and summary statistics from prior genome-wide association studies, researchers came up with 27 exposure polygenic risk scores in the American Journal of Human Genetics.