CHICAGO (GenomeWeb) – Scientists at Columbia University and their colleagues have developed a tool to automatically extract phenotypic information from a patient's electronic health record for improved genetic disease testing.
Four years ago, bioinformatician and computational biologist Kai Wang, now an associate professor of pathology and laboratory medicine at Children's Hospital of Philadelphia and the University of Pennsylvania, developed Phenolyzer — which stands for Phenotype-Based Gene Analyzer —for the purpose of finding genes that may be involved in a patient's condition based on the patient's phenotypic information. Now, Wang, working with former colleagues at the Columbia University Vagelos College of Physicians and Surgeons, is trying to automate the process by linking the technology to electronic health records.
The result is EHR-Phenolyzer, which adds natural language processing to analyze unstructured data in EHRs and other sources, including laboratory information systems, to map the information to the Human Phenotype Ontology.
"The purpose of our tool is to automate the entire procedure so that the unstructured text from genetic counselors can be organized into standardized ontologies that can be included in the order requisition form," said Wang, who moved to CHOP at the end of 2017. "The phenotypic information in the form of standard ontologies can be used together with genome or exome sequencing data to improve the discovery of disease causes."
In particular, the tool is designed to help diagnostic labs find disease-causing genes more readily by providing them access to better phenotypic data for their patients.
"All of us know that phenotypic information is very important to make a genetic diagnosis and to separate genome sequencing data," Wang said. "But, in reality, in a lot of diagnostic settings, phenotypic information is either not available to the diagnostic lab or available only as an ICD-9 code or just a couple of words like 'epilepsy' or 'childhood neurology.' The extra detail is not provided to the person who is analyzing the genome sequence data," according to Wang.
"It is a major challenge for them to identify the disease-causing gene. Recognizing this particular challenge in the diagnostic setting, we decided to develop this tool so that we can best leverage what is already available in the electronic health record," he said.
Wang said that text from genetic counselors' interviews with patients should be "used in a standard way to help interpret genome or exome sequencing data to help improve the diagnostic yield and shorten the diagnostic turnaround time." That is where EHR-Phenolyzer comes in.
He and his colleagues described EHR-Phenolyzer in an article published online yesterday in the American Journal of Human Genetics.
According to the paper, EHR-Phenolyzer is an "automated EHR-narrative-based phenotyping pipeline, to enable phenotype-based gene prioritization."
The main goal of the study was to prove that "deep phenotyping information" mined from EHRs and other health IT systems can improve the association of genetic variants from whole-exome and whole-genome sequencing data with disease symptoms and presentations.
"Our secondary goal was to perform a comparative analysis of well-tested natural language processing (NLP) systems in parsing EHR narratives for phenotype extraction and normalization and to evaluate the ability of EHR-Phenolyzer to analyze real-world EHR data and prioritize candidate genes from WES of positively diagnosed individuals," the authors wrote.
According to the paper, EHR-Phenolyzer involves two steps. The first step recognizes HPO concepts, aided by NLP, through either MetaMap, a public tool, or the proprietary tool MedLEE. The second step uses Phenolyzer to prioritize genes.
MedLEE is from Columbia bioinformatician Carol Friedman, while MetaMap is an NLP engine offered by the US National Library of Medicine.
"We tested how both of them can achieve a comparable performance on the same set of notes," kind of a feasibility test for "accepting meaningful, relevant HPO concepts," said the study's co-leader, Chunhua Weng, a member of the Data Science Institute at Columbia University.
They found that MetaMap generated an average of 17.6 HPO terms per record, while MedLEE produced an average of 19.4. Both are significantly higher than the 11 terms that manual chart extraction turned up on average.
CHOP's Wang called the research a "proof-of-concept study demonstrating the feasibility of using EHR information and integrating phenotype and genotype information to improve diagnosis of patients, and ultimately improve healthcare."
According to the paper, they tested EHR-Phenolyzer on 28 pediatric patients with confirmed diagnoses of single-gene diseases. The tool ranked the genes with disease-causing variants among the top 100 genes in 16 individuals, or about 57 percent.
"The fact that about 50 percent of diagnoses can be narrowed to the top 100 genes on the basis of only phenotype information documented in the EHR is remarkable, especially because this performance can be achieved by completely automated phenotype-concept-recognition methods," the authors wrote.
"We believe that deep phenotypes from EHR data are valuable with the increasing adoption of genomics testing. Improving the prior probability of a diagnosis increases the positive predictive value of a test, although current genomic testing methods tend to forgo this step," they wrote. "Therefore, systematic integration of EHR-phenotype-based gene prioritization before variant interpretation can potentially improve workflow efficiency and help reach clinically valid results while improving diagnostic yield."
The researchers verified their results by testing the technology at the Mayo Clinic. This also served to show that EHR-Phenotyper is compatible with multiple EHRs. Columbia University uses technology from Allscripts Healthcare Solutions, while CHOP and Mayo have Epic Systems EHRs.
Like the baseline study, the verification work took some manual labor, too.
"Even the same EHR system like Epic may store the genetic counselor's notes with different mechanisms," Wang said. "This is where some customization needs to be made for researchers who want to use the tool. They need to figure out in their own healthcare systems how the information is stored" and how to extract that data.
In building EHR-Phenolyzer, Wang and Weng undertook significant manual chart extraction to prepare and test datasets to train the NLP engines in hopes of someday taking the human element out of the process.
"Our goal is to make all of this fully automated so that one command line can retrieve relevant genetic counselors' reports from the EHR and automatically convert them into a standard set of scientific terms and include those terms in requisition order forms. Then, the patient sample together with phenotypic information can be analyzed by a diagnostic lab to pick out the underlying cause of the disease," Wang said.
Automation, he said, would improve diagnostic yield and shorten the time it takes to diagnose a hereditary disease.
"This EHR-Phenolyzer system leverages existing data resources, so it's much more efficient and cheaper than other studies," added Columbia bioinformatician Weng. "This is a good case of secondary use of existing EHR data. It allows us to derive rich phenotype [data] about a patient for genomic medicine."
"If you can get a lot of the phenotype from an EHR, then you can have better knowledge about disease," she said.
The technology certainly could be useful for diagnostics, according to Benjamin Solomon, managing director of the GeneDx subsidiary of BioReference Laboratories.
"I do think this is an exciting development," Solomon said. "A lot of ways of figuring out a diagnosis is like a game of Telephone," in which each person who passes on relevant information does so in a slightly different manner, inevitably changing the interpretation from start to finish.
An approach like EHR-Phenolyzer reduces some of what Solomon called "human fuzziness" and potential loss of information along the path from patient to clinician to EHR to lab. "I like it because it's close to the source," he said.
EHR-Phenolyzer is not unique, in that others are also trying to solve the problem of mapping EHR data to the HPO. Solomon specifically mentioned a Vanderbilt University Medical Center study published earlier this year. That research, however, relied on billing codes rather than clinical notes.
Both can be useful for reaching the goal of improving diagnosis while saving time. "No one is going to solve this all by themselves," Solomon said.
Long-term plans for EHR-Phenolyzer may include commercialization of the technology as a component of popular EHR systems, according to Wang, but that could be years away. In the meantime, testing and refinement will continue at Columbia and CHOP.
"I think there are multiple components that may be improved in the future, for example, the exact selection of NLP software or whether more customized NLP tools that are specifically designed for HPO analysis may achieve better results," Wang said.
The developers will consider whether EHR-Phenolyzer might be able deliver more than just counselors' notes, for example, lab test results, medical imaging, and data points such as age of disease onset.
"It's also possible that some patients may have self-reported phenotype information or phenotypic presentations that are not in the genetic counselor's notes. Maybe that will be helpful to reach candidate genetic diagnoses as well that can be used in conjunction with genetic data to make a final diagnosis," Wang surmised.