Skip to main content
Premium Trial:

Request an Annual Quote

New Method Uses Unstructured EMR Text and Genetic Data to Link Diseases, Cluster Patients


By Uduak Grace Thomas

A Danish research team has published a method that integrates information mined from free text in electronic patient records with protein and genetic information in order to uncover patterns of disease co-occurrence and help with patient stratification.

The team used text-mining techniques to extract clinically relevant terms from hospital staff notes in patient records and then mapped them to diseases codes in the World Health Organization's International Classification of Disease Ontology.

Team leader Søren Brunak, a professor of bioinformatics and disease systems biology at the Technical University of Denmark and the University of Copenhagen, respectively, said in a statement that when he and his colleagues applied their approach to electronic patient records at a local hospital, they were able to identify ten times more medical terms that characterized each patient than were manually entered by the hospital staff.

This additional detail is important, he added, because the terms used by healthcare providers in medical records "are heavily biased by local practice and billing purposes," which limits the ability to choose personalized treatment options.

Brunak said that the project began with his team's interest in characterizing phenotypes. "We actually started by looking at clinical disease descriptions from OMIM ... but there is no individuality to it ... there is nothing that can be used to classify individuals," he explained to BioInform. "We were interested in a more fine-grained characterization of patients and [that’s] why we turned to the patient record because there you have completely individualized information."

With the additional information, the researchers not only achieved the "fine-grained clinical characterization of each patient" they hoped for, but they were also able to find links between diseases and genes, and to stratify patients based on similar profiles.

In a paper describing the method that was published in the August issue of PLoS Computational Biology, the researchers note that there is an increasing focus on "the research potential of both structured and textual data" in electronic medical records and registries and that data from these resources can be used to improve patient safety, monitor adverse events, and identify potential candidates for clinical trials.

Indeed, there are several projects underway that are attempting to harvest the information contained in these systems. For example, the Electronic Medical Records and Genomics, or eMERGE, Network recently published the first results of an ongoing study that used algorithms developed specially for the project to mine EMR data from multiple systems and use it to identify patients in each of five disease groups (BI 4/22/2011).

Another study, presented at this year's Intelligent System for Molecular Biology Conference by a team from Stanford University, demonstrated that text-mining patient records could be an effective way to gather information on safety, efficacy, and potential new drug indications (BI 7/27/2011).

Separately, Aurora Health Care said earlier this year that it is partnering with Oracle to develop software to mine data from Aurora's biorepository, which is linked to health information contained in EMRs. The goal of the project is to find samples of interest for use in biomarker discovery or clinical trials (BI 2/4/2011).

As these systems gain in popularity, there is a need for tools and methods to explore the "treasure trove of data for improving healthcare and research," the authors wrote.

"Extracting the data is a first step ... as [electronic patient record] systems in many countries maintain the use of free text to complement structured data, text-mining approaches are necessary for extracting data usable in further analysis," they said.

For text-based information, most groups rely on natural language processing-based tools, such as MedLee and MetaMap, that are designed to recognize clinical terms and then map them to controlled vocabularies. However for text in Danish, no "EPR information extraction tools exist," the authors wrote.

To address this problem, Brunak explained that the team "recreated" tools and components that are typically used for mining English text to work for the Danish language.

But their method isn't restricted to a single language. Lars Juhl Jensen, a professor in the University of Copenhagen, explained in a statement that because "terminologies like ICD have been translated word by word between languages" it is "possible in principle to use the same term profiles across language barriers and combine cohorts across countries."

The researchers used a Danish translation of ICD-10 containing 22,261 terms that they supplemented with variations of existing codes. This additional information brought the total number of terms to 53,452.

Then they extracted information from text in the EPRs in units of sentences where each sentence is "tokenized treating spaces as word boundaries." Next, a stepping algorithm creates all possible "candidate strings" by "concatenating from one to 10 adjacent tokens." It then searches for each candidate string in the ICD-10 dictionary looking for one-to-one matches.

The team used the approach to extract clinically relevant terms from 5,543 patient records from a local psychiatric hospital before mapping the terms to disease codes in their ICD-10 dictionary.

Specifically, the researchers extracted 31,662 ICD-10 codes from structured text fields, corresponding to 2.7 unique codes per patient on average. Mining the free text in the records, they matched 218,963 text strings to codes in their ICD-10 dictionary.

Combining the mined and the previously assigned codes for these patients resulted in about 12.3 codes per patient.

This information was then incorporated into an association matrix where each patient-ICD-10 combination was assigned a "binary value" and "term frequency-inverse document frequency weighted value" — a statistical measure that evaluates how important a work is to a document in a collection — that indicated whether or not a given code was associated with a given patient and how strong the association was.

To check that their method was accurate, the researchers manually checked 2,724 mined "hits" corresponding to 48 patients and 214 ICD-10 codes.

They reported that 87 percent of the ICD-10 codes mined from the text and added to the profiles were correct, though Brunak told BioInform that the accuracy is actually closer to 97 percent because further analysis of the results revealed that the method was including information on possible side effects of treatments that were recorded in the patients' profiles but which they may not have experienced.

Next, the team explored comorbidities and disease correlations by looking for pairs of codes that occurred together in patients more often than expected using two statistical measures — a comorbidity score and a false discovery rate — to rank 226,801 potential pairs. They came up with 802 candidate pairs that occurred more than twice as often as would be expected by chance.

To identify possible molecular causes for candidate pairs, the researchers mapped some of the correlations to associated genes in the Online Mendelian Inheritance in Man database by exploring gene overlaps in protein interaction networks that are already linked to the individual diseases.

They did this by extracting genes from OMIM that were known associates of the 802 candidate diseases and creating a protein-protein interaction network "by determining the first-order interactions of those genes in refined experimental proteomics data" and then searching for shared interactions between the two networks.

In one example described in the PLoS paper, the team found that thyroid hormone receptor interacts with a zinc finger transcription factor protein involved in alopecia and estrogen receptor 1, which is associated with migraines. This association suggests that these two diseases might share a "similar molecular mechanism of action."

The researchers also grouped 1,497 patients in 26 clusters based on patient-patient similarity. For example, schizophrenia seemed to be a "strong component" in several clusters and alcohol and drug use were also characteristics of patients in these groups, leading the researchers to conclude that this type of abuse is a "good sub-stratification" of the schizophrenia.

"What is important about what we do is that we go from the patient profiles ... compute the co-morbidities ... then do patient clustering and stratification. We go from correlated diseases to overlaps in protein interaction networks," Brunak said. "We connect the phenotypic level in the patient records to the molecular level; this is what is special about this work."

Brunak and his colleagues are currently applying their method to a set of diabetes records as well as to data for a study on infertility. The group is also looking into mining data contained in biobank questionnaires, among other projects.

Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com

Filed under

The Scan

Octopus Brain Complexity Linked to MicroRNA Expansions

Investigators saw microRNA gene expansions coinciding with complex brains when they analyzed certain cephalopod transcriptomes, as they report in Science Advances.

Study Tracks Outcomes in Children Born to Zika Virus-Infected Mothers

By following pregnancy outcomes for women with RT-PCR-confirmed Zika virus infections, researchers saw in Lancet Regional Health congenital abnormalities in roughly one-third of live-born children.

Team Presents Benchmark Study of RNA Classification Tools

With more than 135 transcriptomic datasets, researchers tested two dozen coding and non-coding RNA classification tools, establishing a set of potentially misclassified transcripts, as they report in Nucleic Acids Research.

Breast Cancer Risk Related to Pathogenic BRCA1 Mutation May Be Modified by Repeats

Several variable number tandem repeats appear to impact breast cancer risk and age at diagnosis in almost 350 individuals carrying a risky Ashkenazi Jewish BRCA1 founder mutation.