Skip to main content
Premium Trial:

Request an Annual Quote

Family Histories for PheWAS May Already Be in Patient Records


CHICAGO (GenomeWeb) – All the data necessary to create family pedigrees for genetic research may already be contained in electronic health records, according to proof-of-principle research presented at the American Medical Informatics Association annual conference in San Francisco earlier this month.

Bioinformaticians led by graduate student Xiayuan Huang at the University of Wisconsin-Madison have developed a computing method — called Logistic Regression on Familial Relatedness (LRFR) — to mine through millions of records in an EHR system and automate the process of building pedigrees. This automation could make pedigree analysis more like phenome-wide association studies, helping narrow down subject pools for genome-wide association studies and other research, according to Huang.

The goal of the study was to demonstrate that family pedigrees may be rapidly predicted from EHR systems without human intervention, according to a paper and a poster presented by researchers from UW, Case Western Reserve University, and the Marshfield (Wisconsin) Clinic. The peer-reviewed paper was published in the official AMIA conference proceedings.

This new study is an outgrowth of work described in a February 2018 article in the journal Bioinformatics, in which the researchers described their family pedigree prediction algorithm and the Logistic Regression on Familial Relatedness method.

The researchers make their predictions from demographic data already in the EHR without needing to manually collect additional family histories from patients. "You don't need to enter any family background," Huang told GenomeWeb.

Using this method, the researchers were able to identify with high probability 173,368 family pedigrees, as well as 579,561 individuals linked to these pedigrees.

"With the anticipated widespread application of genomic medicine, in combination with methods capable of predicting family pedigrees linked to extensive and longitudinal phenotypic data, a revolution may occur in how human genetic research is conducted for the advancement of precision medicine," the authors added.

Huang and colleagues designed a decision tree-type algorithm to sort through some 2.6 million records in Marshfield Clinic's homegrown EHR, known as CattailsMD, and available on the commercial market through the organization's for-profit Marshfield Clinic Information Systems subsidiary. The longitudinal patient records date to 1984 and associated billing records, coded in ICD-9, date to 1979 and cover multiple locations throughout Wisconsin.

The Bioinformatics article covered a test run of just 29 ICD-9 codes. At AMIA, the researchers presented work involving more than 2,000 diseases that correspond to ICD-9 billing codes.

From the large dataset, researchers extracted surname, address, account information, age, gender, and other demographic data points, then ran the information through their decision tree to predict familial relationships.

The system looks for phenotypic relations between two people that could signify familial relations. "If you have [a certain form of] cancer, then other members of your family have a higher probability of getting that cancer," Huang said. "We can extract that health relatedness within a pair of individuals."

"We can capture some significant signals of heritable phenotypes," Huang added. "Using predicted family pedigree information and our LRFR model, we can discover that some phenotypes are heritable, but they may not [have been] discovered to be heritable before, and we can also separate heritable diseases from non-heritable diseases."

The first pass through the decision tree — filtering for common home addresses and last names — turned up nearly half a million parent-child relationships and more than 100,000 potential sibling relationships.

Some analyses turned up several candidate parents for certain patients, but, obviously, each person can only have one biological parent of each sex, so another layer of rules helped filter out extraneous matches.

"For every patient A with disease D, an instance is created for each other patient B within A's family. In this instance, the independent variable is the relatedness of A and B, and the binary dependent variable indicates whether B also has the disease D," the paper explained.

A phenome-wide application of this algorithm assessed the effectiveness of the logistical regression by comparing areas under the curve and P-values for the entire dataset in search of congenital and non-congenital factors in various diseases.

"Based on the categories, we can do disease predictions, especially for genetic-related diseases," and identify good candidates for genetic testing, Huang said.

With the testing on upwards of 2,000 ICD-9 codes, the researchers ranked diseases by probability for each patient. "We can do disease predictions for a PheWAS, then we try to estimate all the those probabilities of each phenotype for all ICD-9 codes," he said.

This helps in clinical care as much as in research, according to Huang, since family history can be a key predictor of future diseases, but clinicians don't always get a complete history.

"Capture of clinical family histories can be labor-intensive and often provides only a static snapshot of a patient’s ever-changing family history. Attaining a family history is dependent on a patient’s understanding, memory, and cooperation and is often disease-limited," Huang said. "As such, critical information can often be missed."

For example, a hypothetical 30-year-old female patient might tell her physician that her paternal grandmother died of breast cancer at age 80, while her mother died of unrelated causes at age 40. But this woman might not know that a paternal aunt and uncle also had been diagnosed with breast cancer, Huang noted.

"Unless [her] physician ever becomes aware of the now extensive family history of breast cancer, no additional interventions may be expected," he explained. "However, this information may already be readily available if her family is connected to the same EHR system."

Outside observers had mixed opinions. 

Sarah Pendergrass, a researcher with Geisinger Health System's Biomedical and Translational Informatics Institute, suggested that the LRFR method could turn up false positives. "A big issue here is that diseases are impacted by environment as well as through genetics, and without the genetic data on the family you may be seeing more the impact of shared environment," Pendergrass said via email. 

"Because there is not genetic data, there is less utility for this, unless you found something that seemed to run in a family and then decided to collect genetic data to pinpoint what is contributing genetically to this family," Pendergrass said. "There could potentially be some discoveries, but many of the highly penetrant genetic disorders that very obviously run in families already have gene candidates."

Plus, Pendergrass added, EHRs that are not absolutely complete might miss some polygenic conditions or heterogeneous diseases, including mental health and addiction, leading to. "EHR data has a lot of 'missingness,' which will impact discovery," she said.

A better approach, in her opinion, would be to improve collection of screening survey data and family histories, then adding that information to EHR data linked to genetic biorepositories. "That data would be directly useful for clinicians and patients as well as researchers," Pendergrass said.

Joshua Denny, associate professor of biomedical informatics and medicine at Vanderbilt University, said that it "makes a lot of sense" to mine EHRs to find familial health relationships. 

"There is huge unleashed potential in EHRs, and doing so can provide a large resource to provide a clinical assessment of heritability to compare and combine with genetic heritability," Denny wrote in an email. "The advantage of the huge populations available in EHRs is it would cross different geographies and ancestries today better than our number of genotyped subjects do. Then, by comparing between genotyped and clinical heritability, it may provide insights into the influence of shared social/environmental factors on heritability."

Denny recently co-authored a commentary in Cell about a similar EHR-mining project led by Columbia University, also detailed in Cell.

Following the AMIA presentation, the University of Wisconsin has begun working on extending the LRFR methodology to GWAS research.

"Since our method can capture some signals of heritable diseases on phenome-wide [studies], then we are thinking to pick those top significant heritable diseases we found based on our model, then associate those heritable diseases to some unknown SNPs," Huang said. "Maybe those heritable diseases are sharing the same piece of genetic information, such as the same SNPs," he surmised.

Meanwhile, Marshfield Clinic is applying the technology in practice to build what has been dubbed the Large Family Pedigree Cohort, according to Huang.

"We are in the process of testing and improving our family-pedigree software," he said. "Our goal is that the final version of this software can automatically predict and keep tracking and storing family history when the EHR system is updated."