NEW YORK (GenomeWeb) – Researchers at the European Bioinformatics Institute have developed a statistical tool to analyze the effects of multiple genetic variations on groups of phenotypic traits in large-scale genome-wide association studies in a computationally efficient manner.
The so-called multi-trait set test (mtSet), which is described in a brief communications piece published this week in Nature Methods, makes complex large-scale GWAS analyses involving data from up to 500,000 individuals simpler and more manageable, according to its developers. It builds and improves on existing statistical techniques used to analyze pairs of traits, offering what its creators claim is a more computationally efficient mechanism to analyze larger numbers of traits and cohort sizes than is currently possible with existing methods.
The EBI-developed tool provides a "principled approach" to testing for statistical relationships between multiple genetic variants and phenotypic traits, Oliver Stegle, a research group leader at EMBL-EBI and a co-author on the mtSet paper, said in a statement. That's a step up from a number of existing methods which are designed to analyze the effects of multiple variants on a single phenotype or interactions between a single variant and multiple phenotypes. These methods use models that are "too simplistic to uncover the complex dependencies between sets of genetic variants and disease phenotypes," Stegle said. More complex models to jointly analyze pairs of traits such as multi-trait linear mixed models (LMMs) have been used in previous studies, but these become computationally heavy as the number of traits and individuals rises, according to the paper.
MtSet builds on the multi-trait LMM approach. Essentially the method combines set tests, a type of statistical regression model, with multi-trait modeling to provide a computationally efficient method for analyzing correlations between genetic variants and multiple correlated traits while accounting for population structure and relatedness — these are important for reducing false positive associations.
It relies on a statistical trick that makes running otherwise prohibitive computations more manageable, Stegle told GenomeWeb this week. "We leverage the fact that typically the number of variants in a genome that we are interrogating ... is smaller than the number of individuals and that insight can be exploited to scale up methods and make these computations feasible," he said. According to the paper, mtSet can be used to test sets of variants and up to ten traits in structured populations of moderate size — about 20,000 individuals. It can work with even larger cohorts if the constituent individuals are unrelated, the researchers wrote.
The researchers tested mtSet's performance on data from at least two studies and compared its results with those provided by two similar tools. Specifically, they compared it with stSet and two iterations of a single variant linear mixed model approach. They also compared it with a slightly different variation of itself dubbed mtSet-PC. In one project, they looked at four lipid-related traits —LDL and HDL cholesterol levels, C-reactive protein, and triglycerides — and tried to identify variants in a single gene that could be involved in regulating a particular lipid trait. They also looked at the combined effects of variations across larger sets of lipid levels to learn more about how the lipid regulation occurs. The data used for this test case came from more than 5,200 unrelated individuals from the Northern Finland Birth Cohort. The researchers also applied the methods to data from a quantitative trait loci study of more than 1,300 outbred rats looking specifically at six traits related to basal hematology.
The results of the analysis, according to the developers, demonstrate that their method does improve on some existing approaches and successfully explains a large proportion of traits in terms of their underlying genetics. For example, in the test involving lipid trait data, mtSet identified 14 significant trait loci associations, 13 of which had been previously identified in a separate analysis of the data. In contrast, the single-variant LMM methods missed four associations detected by mtSet; while stSet missed three associations — the paper notes that mtSet-PC did a little better than mtSet by identifying a total of 16 QTL associations. In terms of the rat study data, mtSet did better than both stSet and the single-variant LMMs, identifying one more QTL than both methods did.
The developers believe that the method could be a useful tool for researchers in their quest to gain new insights into the genetic underpinnings of many biological processes. Ewan Birney, associate director at EMBL-EBI, said in a statement that mtSet offers "a real advance" over existing approaches in terms of its ability to analyze multiple variants and phenotypes and its scalability. The method could be useful, he said, for analyzing very large cohorts that make up initiatives such as the UK BioBank
Stegle believes that the method could serve as a complement to existing GWAS analysis software. "There is no replacement to single-SNP, single-phenotype analysis; that is still state of the art in many questions," he told GenomeWeb. But the method can enhance analysis in rare variant association studies, for example, or potentially in cases where there are multiple potential causal variants in closely located genes that make it difficult to pinpoint the exact causal variants, he said. It could also be useful whenever there is a logical prior belief that multiple phenotypes share a common genetic mechanism, he added.
For their next steps, Stegle and his colleagues plan to use mtSet to explore associations between variants and phenotypes in greater depth. Currently, we can test formally whether a group of variants [is] associated to at least one of the phenotypes that we consider ... [but] what we don't get out of this approach yet is to understand what's going on," he said. "Is it really just one of the traits that is associated with one of the phenotypes, is it all of them, is it a subset? So understanding more about the [genetic] architecture ... that's a natural extension."