Scientists affiliated with the Electronic Medical Records and Genomics Network, or eMERGE, this week published the first results of an ongoing study to determine whether data from electronic medical records can be used to identify disease phenotypes with sufficient statistical power for use in genome-wide association studies.
The team reported in a paper published in Science Translational Medicine that data captured in EMRs as part of routine clinical care proved "adequate to define five disease phentoypes across five different study sites" with "robust" positive and negative predictive values for use in GWAS.
The researchers also highlighted the importance of natural language processing tools as a "critical" resource for extracting key data from text documents.
Led by Abel Kho, an assistant professor of medicine and associate director of the medical informatics program at Northwestern University, the eMERGE team used algorithms developed specifically for the project to mine EMR data from five institutions participating in the study and used it to identify patients in each of five disease groups: dementia, cataracts, peripheral arterial disease, type 2 diabetes, and cardiac conduction defects.
Launched in 2007, eMERGE is funded by the National Human Genome Research Institute and includes five participating sites — the University of Washington, the Marshfield Clinic, the Mayo Clinic, Northwestern University, and Vanderbilt University — and a coordinating center at Vanderbilt.
As part of the project, each participating institute selected a disease phenotype and developed algorithms to identify it from EMR data. These algorithms included methods for natural language processing, structured data extraction, and free-text searches.
For the work described in the Science Translational Medicine paper, the team collected data on diagnoses, medications, procedural codes, laboratory tests, radiology test results, and ECG report results as well as information on demographics, family and smoking history, height, and weight.
To ensure accuracy, the team checked the results from the EMR-derived phenotypes against traditional methods of diagnosis such as doctors' notes and medical charts.
They also set minimum requirements for the data collected. For instance, each patient had to have two documented clinical visits. Furthermore, the researchers built some quality-control measures into their algorithms, such as practical limits for some types of data such as height and weight, Kho said.
The researchers reported that they were able to use EMR data alone to correctly identify known disease phenotypes with positive predictive values close to 100 percent for four of the study sites. The exception was the Group Health cohort, in which EMR data correctly identified 73 percent of dementia cases.
Kho explained to BioInform that Group Health's much lower results could be attributed to the complex nature of dementia. Standard methods of diagnosis for the condition often rely on unstructured information in doctor's notes, which is difficult to capture in a more structured format that a computer can consume.
The team also assessed the ability of natural language processing to improve identification rates. They compared cases at Vanderbilt University that were identified in EMR data using structured data alone compared with cases identified using both structured data and NLP.
The researchers were able identify 129 percent more cases of quantitative trait duration, a measure of cardiac conduction, using NLP than with structured data and string matching alone, while maintaining a positive predictive value of 97 percent.
In other results, the eMERGE team reported that the EMR-based approach worked in spite of differences between home-grown and commercial systems. While it was difficult to pinpoint exactly what these differences were due to the proprietary nature of commercial system, the similarity suggests that there is a "potential for broad dissemination of our approach to identify cases and controls for genetic analyses to achieve well-powered studies."
This bodes well for vendors since it is likely that commercial EMRs will be more widely adopted in routine clinical care rather than systems built from the ground up, Kho said.
The next steps are to apply the approach to new diseases and make the algorithms widely portable so that other clinics can use them, Kho said, as well as "to see whether genetic data can be pushed back into the clinical setting" and if it has an effect on clinical care.
The work described in the Science Translational Medicine paper is based on the first phase of the eMERGE consortium, and NHGRI is currently planning for the second phase of the project.
In a funding announcement issued last summer, NHGRI said it plans to award $22 million over four years to eight study investigators and another $3.5 million to a single coordinating center beginning July 1, 2011 (BI 07/23/2010).
Phase II of eMERGE "will begin to incorporate current genomic knowledge combined with available genotyping data and state-of-the-art electronic phenotyping and privacy protection methods into clinical research and ongoing clinical care," NHGRI said, as well as "expand the phenotype library and ensure its transferability outside eMERGE; increase the diversity of patients and settings; and incorporate GWA results in these patients into their EMRs for clinical use."
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.