Skip to main content
Premium Trial:

Request an Annual Quote

Statistical Method Enables Gene-Phenotype Association Analysis in Heterogenous UK Biobank Data


NEW YORK (GenomeWeb) – Researchers from the University of Oxford and elsewhere have published details of a mathematical framework that they developed for identifying statistically significant associations between genetic variants and human phenotypes in routine healthcare data contained in biobanks.

According to a paper published this week in Nature Genetics, the method, called TreeWAS, provides a way to "interrogate increasingly specific subphenotypes while retaining statistical power to detect genetic associations." The method takes advantage of "the hierarchical structure" of diagnosis classification trees to analyze genetic variants against UK biobank disease phenotypes culled from patients' reports and hospital episode statistics.

The researchers claim that their method provides "a more than 20 percent increase in power" to detect genetic effects over other approaches including phenome-wide association studies and successfully "identifies new associations between classical human leukocyte antigen alleles and common immune-mediated diseases."

In an interview, Gil McVean, a professor of statistical genetics in the Nuffield Department of Medicine at the University of Oxford and one of the lead authors on the paper, said that TreeWAS' development was motivated by a desire to incorporate diverse data types into gene-phenotype association studies. Integrating genomic data with routine healthcare information can help researchers better understand disease risk and improve disease diagnosis and reporting.

With the release of the UK Biobank, which contains data from half a million individuals, researchers have access not just to genetic information but also patient records and self-reported data which they could use to better understand the links between genes and complex diseases. With the UK biobank data, "we have the potential not to peer into one particular disease that's been defined by a clinician in one particular way," McVean said. "We have access to a huge array of very diverse big data types such as information from self-reporting or hospital records or from primary care or prescription… so the background to this work was trying to ask the question 'how do you analyze how genetics influences the entire spectrum of human traits that you can pick up through this big data?'" 

Access to the biobank data makes it possible to explore in greater detail how genetics influences an entire spectrum of human phenotype traits rather than focusing on very specific questions about how genetics influences human disease and phenotype, McVean explained. It is a question that researchers can try to answer on a gene-by-gene basis but that approach has very low power in large datasets, he said.

"Because you are doing a huge number of tests, often by the time you've diced and sliced the data down to the level of the [patient] records, you are dealing with very few alterations," he said. Also, current approaches for performing genetic association studies fail when they are simultaneously applied to large quantities of heterogeneous data.

To get around these issues, "you have to use some of the structure that's present within the data to give you power," McVean explained. "What we did was exploit a very particular feature of that structure which is that a lot of human phenotype data is encoded in these hierarchical or tree-like ontologies… that provide increasingly specific views on a person's disease presentation."

For example, a group of patients might have been diagnosed with nervous system disorders. Within that group, there are patients with a neurodegenerative disease like multiple sclerosis. For a subset of the group, there could be data on first onset of the disease and current status of the patient.

"We developed a way of essentially looking at the entirety of that data in one go … which basically allows me to ask this question of 'is this genetic variant associated to any aspects of human disease and if it is which parts…is it associated to?'" McVean explained. It allows researchers to explore not just the relationship between HLA alleles and the primary phenotypes they are associated with, for example, but also potentially related complications such as various eye diseases that might be tied to the allele and primary phenotype. The method also allows researchers to explore associations between genetic risk scores derived from genome-wide association studies and different phenotypes.

For the paper, McVean and his colleagues focused largely on assessing gene variant-phenotype associations in autoimmune diseases. TreeWAS is able to handle different kinds of genetic variation including SNPS and haplotypes in highly polymorphic regions. It also accommodates single-locus variation and supports joint analysis and quantification of association evidence at each clinical phenotype. Furthermore, TreeWAS can identify independent genetic effects and can model the correlation structure of genetic effects across clinical phenotypes using prior knowledge of phenotype relationships.

"Quite a lot of the paper focused on looking at the uniqueness of genetic risk for autoimmune diseases across the human phenotype and what differences or similarities you get when you look across the hospitalization data or the self-reported data," he said.

In one study described in the paper, the researchers explored associations between genetic risk scores and autoimmune diseases in both hospital records and patient reports. They claim that by analyzing genetic risk scores associated with autoimmune diseases they could show the extent of "genetic sharing" among these diseases as well as "expose differences in disease perception or diagnosis with potential clinical implications."

For example, the researchers found an association between thyroid disorders and increased risk for various autoimmune conditions that was not expected. "It suggests that there is a shared component among all these autoimmune diseases which also manifests, maybe at a much lower level of the problem, as disorders of the thyroid," McVean said.

They also reported differences in some cases between self-reported data and hospitalization data. "We devised a way of saying how close to clinical phenotype is the information that you get from routine hospital data or self-reporting, and typically the accuracy was higher from self-reporting than it was from hospitalization," McVean said. This was true for cases of patients with multiple sclerosis. What these results suggest is that "if you wanted to do a high-throughput way of trying to work out who's got multiple sclerosis, then you are much better off asking people than you are peering into their hospital records."

There were however some instances where hospital records proved much more useful. For example, in lupus cases, while the genetic support for the disease was quite weak, hospital records which include details about diagnosis and hospitalizations for the disease provided a much stronger signal.

To showcase the benefits of their method over existing strategies, the researchers compared TreeWAS results to those from phenome-wide association studies of the same allele. Specifically, they searched for associations between the HLA-B*27:05 allele and ICD-10 codes used in the UK Biobank dataset using both approaches. The aforementioned allele is associated with ankylosing spondylitis and also confers risk for conditions such as reactive arthritis and psoriatic arthritis.

Using the PheWAS approach, the researchers correlated the allele to six ICD-10 codes; however, it missed associations with terms that had more detailed clinical descriptions and low prevalence in the data sets, such as M45.X6 ankylosing spondylitis with lumbar spine involvement. In contrast, TreeWAS identified associations with 145 ICD-10 codes. That list included both known and several new associations.

For their next steps, McVean and his colleagues are looking to use their method to try to find new pathways for therapeutic interventions. They'll also seek to incorporate longitudinal data into their association analyses. 

"Now that the databank has been released, we are going through every variant in the genome and looking to see what it's associated with," McVean said. This will hopefully help the researchers to identify the underlying network structure that relates variants to genetic disorders and highlight pathways that could serve as therapeutic targets for disease. Furthermore, they want to bring in additional information such as disease onset and progression data, combine it with the genetic information and try to make better choices about which treatments to give patients as well as to improve disease diagnoses.

"I think we'll stay in the autoimmune space but the techniques that we develop are very generic [and] will be useful to people in many different spaces," McVean said. So far, the method is being used on data from US-based biobanks in New York and elsewhere as well as by researchers in China to analyze data contained in Chinese biobanks.