NEW YORK (GenomeWeb News) – In a paper appearing online last night in the Proceedings of the National Academy of Sciences, a team of researchers from Vanderbilt University described their method for decreasing the risk of individual re-identification when using data from electronic medical records in genome-wide association studies.
The approach, called Utility Guided Anonymization of Clinical Profiles, or UGACLIP, involves generalizing some of the diagnostic information housed in electronic medical records to make it more abstract and anonymous — a method the team validated using data from nearly 3,000 patient electronic medical records housed at the Vanderbilt University Medical Center.
"Our approach automatically extracts potentially linkable clinical features and modifies them in a way that they can no longer be used to link a genomic sequence to a small number of patients, while preserving the associations between genomic sequences and specific sets of clinical features corresponding to GWAS-related diseases," senior author Bradley Malin, a biomedical informatics researcher at Vanderbilt University, and his co-authors wrote.
Electronic medical records have the potential to help scale up GWAS while keeping costs down, lead author Grigorious Loukides, a research fellow in Malin's Vanderbilt lab, told GenomeWeb Daily News. But, he added, when researchers use this data they are faced with the challenge of gleaning as much pertinent information from these records as possible while still protecting individual's privacy.
Because each individual's combination of conditions and clinical features is quite distinct and often turns up in multiple repositories, it may be possible to link an individual to his or her genetic profile based on the diagnostic codes used in the GWAS, Malin told GWDN, which means the privacy of that genetic data could potentially be compromised when researchers make their data available to other members of the community.
For instance, the team found that for a group of roughly 3,000 patients selected from a pool of more than a million individuals, they could identify individuals based on their combination of diagnostic codes almost 97 percent of the time.
"Basically what we find is that some combination of diagnostic codes can distinguish a patient," Loukides said.
For the current study, which was funded by the National Human Genome Research Institute and the National Library of Medicine, the researchers developed a way to exploit the clinical coding hierarchies that already exist in diagnostic criteria, Malin explained, generalizing the clinical features so that they no longer point to one individual's medical record. That, in turn, aims to ensure individual genetic privacy, while still maintaining enough information to allow sharing and verification of the data.
The team then tested this UGACLIP algorithm on two real patient data sets from Vanderbilt University Medical Center's electronic medical records system — one that included 2,762 individuals being studied as part of an ongoing GWAS and another smaller set of 1,335 medical records — to see how much information was retained after applying the algorithm and whether this data would still be useful for GWAS validation studies.
They found that the algorithm did decrease the risk of linking an individual to their GWAS data while maintaining much of the clinical and diagnostic information needed to permit data sharing and follow up studies, Malin said.
Overall, he and his colleagues wrote, the experiments "verify that our approach generates data that eliminate the threat of individual re-identification, while supporting GWAS validation and clinical case analysis tasks."
Even so, the researchers are still working to improve the algorithm. They also plan to develop software that can be employed by other investigators using electronic medical record data in GWAS.