Skip to main content
Premium Trial:

Request an Annual Quote

Stanford Team's Text Mining Approach Extracts Useful Information from Clinicians' Notes


Stanford University researchers have developed a method of extracting useful information from unstructured clinical notes in electronic health records.

The approach uses ontologies such as the National Library of Medicine's Unified Medical Language System to annotate medical concepts in the notes so that can be mined and analyzed to obtain useful information for things like drug safety studies, hypothesis testing, profiling off-label drug use, and more.

According to Nigam Shah, an assistant professor of medicine and the team's leader, the method provides a way for researchers to exploit what to date has been a largely untapped source of valuable medical data.

"If you ask any audience related to health care how much of the clinical knowledge is bundled up in text, you won't get an answer below 70 percent," he said in a statement. "If 70 to 80 percent of the data is locked up in text notes, we asked ourselves, 'What would be a good way to unlock it?'"

A detailed description of the approach was published earlier this month in Nature Clinical Pharmacology and Therapeutics.

In that paper, the researchers explored how the method could be used in drug safety studies by mining roughly 10 million physicians' notes on about 1.8 million patients in the Stanford Translational Research Integrated Database Environment, or STRIDE, over a 15-year period to try to identify harmful drug reactions.

According to the researchers, they first annotated drugs, diseases, procedures, and devices in the STRIDE clinical notes using ontologies from UMLS, the National Center for Biomedical Ontology's BioPortal, and others. This part of the process also includes a filtering step that sifts out "uninformative phrases" and "ambiguous terms," as well as data-normalizing steps. The concepts from this cleaned up lexicon are then used to fill out a so-called "patient-feature matrix" which serves as the "data-mining substrate."

The basic idea of the matrix, Shah explained to BioInform this week, is that "for every patient, we want to look at every medical concept in a given time interval [to see] if for that patient, for that particular medical concept, there was a present positive mention of that concept in their clinical documents." This information is used to fill in the rows and columns of the matrix, he said.

Once they've created the matrix, researchers can then use statistical analysis tools to explore associations in the data. In the Nature Clinical Pharmacology and Therapeutics paper, the researchers report, for example, that when they used the method to analyze a reference dataset composed of 78 drugs including rofecoxib and celecoxib, and 12 different drug-related adverse events such as myocardial infarctions and acute renal failure, they were able to correctly identify all 28 known positive associations with confidence intervals of around 95 percent depending on the cutoff used and the false discovery rate tolerated.

The team has also used the method in a related but separate study — which has been accepted for publication in Plos One — which focused on profiling the safety of cilostazol — manufactured by Otsuka Pharmaceutical under the brand name Pletal — which is used to treat intermittent claudication (muscle pain) in individuals with peripheral arterial disease. The drug currently carries a warning label because of concerns about increased risk of mortality in patients with congestive heart failure.

According to the researchers, using their pipeline to analyze EMR data from 232 PAD patients that were taking cilostazol and a control group of 1,160 PAD patients not taking this drug did not yield any association between taking cilostazol and cardiovascular events. This suggests, they concluded, that further studies may be necessary to determine if the safety warning on cilostazol is necessary.

Another benefit of the method, at least in terms of drug safety, is that it can flag adverse drug events much earlier than the US Food and Drug Administration's Adverse Event Reporting System, which compiles reports of medication side effects from patients, physicians, and pharmaceutical manufacturers.

In fact, Shah told BioInform that based on their calculations their method can flag drug-related adverse events on average about two years before the FDA issues an official alert.

Proof of this point is provided in the form of several graphs in the Nature Clinical Pharmacology and Therapeutics paper that show that when the method was used to check for adverse event signals for nine drugs known to be problematic, the Stanford approach flagged signals earlier than official FDA alerts in six of the nine cases.

There are some limits to the method. The researchers concede that it does require a big database in order to extract accurate trends. They also note that for drug-related studies, the FDA's reporting system is probably better at catching rare events, which wouldn't occur in high-enough volume at any single institution.

Furthermore, the method doesn’t account for dose-dependent adverse reactions, nor does it distinguish between new and existing users of drugs, the researchers wrote.

However, they are working on refinements that they say will make it possible to extract other kinds of information from the clinical notes, such as reports of reactions caused by drug combinations, opportunities for drug repurposing, or finding medical profiles of patients that fit a certain scenario.

Filed under