Researchers have performed a proof-of-concept study to demonstrate that text-mining patient records could be an effective way to gather information on safety, efficacy, and potential new drug indications.
The team, led by researchers at Stanford University, presented the work earlier this month at the Bio-Ontologies Special Interest Group Meeting, held prior to the Intelligent Systems for Molecular Biology Conference.
In a presentation, Stanford's Paea LePendu described how he and colleagues mined unstructured text stored in electronic medical records in order to reproduce the findings from a study published in 2005 in Lancet.
The paper found that the anti-inflammatory drug rofecoxib, marketed by Merck as Vioxx, increased the risk of serious coronary heart disease. Merck withdrew the drug from the market voluntarily in 2004.
LePendu said his team is already applying the method to detect off-label drug use in clinical records. Specifically, they are trying to determine which diseases are likely indications for the cancer drug bevacizumab, marketed by Genentech as Avastin.
The study is an example of how researchers are increasingly using ontologies to address biomedical problems, which Nigam Shah, an assistant professor of medicine at Stanford and a Bio-Ontologies SIG's coordinator, told BioInform is a growing trend in the community.
Shah said that there are currently between 300 and 350 publicly available ontologies in the biomedical space. With all this information at their fingertips, researchers are using these tools to mine text and databases to look for drug, genetic, and disease relationships and associations.
In his presentation, LePendu described how he and colleagues used the National Center for Biomedical Ontologies' Annotator web service to find disease and drug annotations from physicians' notes and then mined the resulting annotations to compute the risk of a patient having a myocardial infarction after taking Vioxx for rheumatoid arthritis.
LePendu told BioInform that his team chose to reproduce the Vioxx study as a proof of concept that useful patterns can emerge when researchers use text analysis tools and ontologies to search for places where drugs and diseases are mentioned in large quantities of patient records.
A New Workflow
While there have been other efforts to show that adverse drug effects can be mined from the medical literature — most recently shown in a paper published last month in JAMIA by a group from Rand that also recapitulated the findings in the Lancet Vioxx paper — the Stanford project marked the first attempt to do this using only unstructured data in electronic medical records, LePendu said.
His team noted in its SIG article that while "analyzing structured EHRs has proven useful in many different contexts, the true richness and complexity of health records — roughly 80 percent — lies within [unstructured] clinical notes."
The Stanford researchers used an annotation workflow tool based on the NCBO Annotator to mine textual notes for more than 1 million patients stored in the Stanford Translation Research Integrated Database Environment, or STRIDE, an informatics platform developed to support clinical and translational research.
STRIDE's data warehouse, which integrates data from Lucile Packard Children's Hospital and Stanford Hospital and Clinics, contains1.6 million pediatric and adult patients with clinical and demographic data, 15 million clinical encounters, and 25 million ICD9-coded inpatient and outpatient diagnoses, among other records.
During his presentation, LePendu explained that the analysis workflow relies on Gene Ontology-based functional-enrichment analysis, which aims to understand "which functions are most significantly represented for a set of genes or proteins differentiated via altered expression levels."
Beginning with a set of genes of interest, a user can apply known functional annotations, obtained via GO annotations, and then compare the results to a reference to find which functions are over- or under-represented.
He said the Vioxx study extended these principles into the clinical setting by using the NCBO Annotator to extract disease or drug profiles from patient records.
The first step in the workflow is to select ontologies of interest — in this case medical ontologies such as the Systematized Nomenclature of Medicine - Clinical Terms. Researchers then extract lexicons from them and clean them up using statistical natural language processing methods.
Finally, they use the lexicon to annotate the text in the clinical notes and combine them with other patient data such as demographics and coded primary-discharge diagnoses.
For the Vioxx study, LePendu's group configured the workflow to use 16 ontologies that are relevant to the clinical domain.
The team used the workflow to search EHRs for patients with ICD9 codes indicating that they had RA and MI. Then they looked through "normalized annotations" of clinician notes looking for mentions of RA, MI, and Vioxx.
Out of 1,827 patients with RA who had MIs, 339 were identified who took Vioxx prior to their MI. Furthermore, the team found that when they used only ICD9-coded data without the clinical notes, the results were "more ambiguous." Specifically, they were only able to identify 77 patients with RA and MI, with only 16 reported to have taken Vioxx prior to their MI.
The larger population size of the unstructured notes made it easier to see patterns in the data, LePendu said. By comparison, the JAMIA study used only medical literature so it had a much smaller sample size, and the data required a lot of cleaning up to get good results, he said.
New Indications, New Issues
LePendu's team is now using the approach to find new indications for the cancer drug Avastin and they plan to do the same for other drugs as well.
Although the drug, marketed by Genentech/Roche, is typically used to treat metastatic cancers, LePendu said based on annotations that co-occur with Avastin in medical records, the team suspects that it is also being prescribed off-label to treat macular degeneration, diabetic retinopathy, and retinal vascular occlusion — all of which have been reported in the literature.
However, there are some off-label indications for the drug that haven't been reported in the literature and have no scientific backing whatsoever, LePendu said.
He said his team plans to apply the same approach it used for the Vioxx study to look for "co-occurences of drugs and diseases." In this case, instead of looking for adverse events , "we are going to look for [where] the drug occurs just after the disease was found, or very close ... so the association between the drug and the disease is so strong that it’s a high likelihood that the drug was given to treat the disease."
Then, "we can take databases that contain known indications [for the drug] and compare them to what we find in patient records," thus confirming possible off-label use, LePendu said.
Yet even as ontologies are gaining traction in biomedical applications, this broader use has identified issues that were not taken into account when these resources were designed originally.
For example, some ontologies, such as the Foundational Model of Anatomy ontology from the University of Washington, are designed for their “logical cleanliness,” Shah explained.
However, “when you use ontologies that are designed to be good logical structures for text mining, you run into two problems,” said he added. “One is you have all of these classes whose names never appear in text … and second you end up missing a lot of things that do appear in text but are not listed in the ontology.”
That’s because ontologists focus on “designing the logical structure so that it accurately represents the phenomenon … they are trying to represent” rather than all possible synonyms for the phenomenon.
For example, the FMA ontology makes a distinction between parts of the body that are solid and those that are space and assigns specific names to each region.
"These are distinctions that are really important to make when you are trying to organize human anatomy, but the particular name 'cavitated organ' [for instance] is not a string of text you want to use in your text mining," Shah said.
One solution that has been proposed by groups such as Rebecca Crowley's lab at the University of Pittsburgh is "text mining frequency analysis," which identifies patterns in which ontology terms are used in order to suggest terms that should be added to an ontology, he explained.
For example, because diseases are referred to differently in clinical trials, existing disease ontologies could be missing up to 20 percent of the terms that would be needed for thorough analyses, he said.
"What you do is ... take an existing list of diseases and find them in clinical trial descriptions," he said. "In sentences where you find the disease ... look at what words appear both before and after diseases and make a table of these patterns ... then take those patterns back to your corpus and see what other terms would fill in the blanks ... and those are candidate disease terms that you might want to include in your disease ontology."
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.