NEW YORK (GenomeWeb News) – In the paper set to appear online this week in the Proceedings of the National Academy of Sciences, a team of investigators from Microsoft Research and Pacific Biosciences describe their new statistical method for overcoming confounders that may muddle expression quantitative trait locus analyses.
When looking for associations between SNPs and gene expression, the team explained, both population structure and microarray expression artifacts can lead to the detection of false positive eQTLS in some cases or may cause researchers to overlook authentic associations in others.
In an effort to minimize such problems and improve the detection of authentic eQTLs, the researchers came up with an algorithm designed to ameliorate problems associated with such population structure and expression artifact confounders.
"With eQTL analysis where you have both SNPs and gene expression data, you have not only problems with confounders in the expression array data, but you also have confounders related to population structure that are more commonly seen in traditional genome-wide association studies," lead author Jennifer Listgarten, a researcher with Microsoft Research in Los Angeles, told GenomeWeb Daily News.
"You're sort of mixing and matching two different areas of statistics," she added. "So now we're having problems from both of these areas — each of which has known confounders. We're showing how you can tackle both of those jointly in an appropriate way."
With the advent of larger and larger genetic studies, she and her co-authors explained, it's important to recognize — and find strategies for dealing with — confounders in the data used to uncover eQTLs.
"[E]fforts are being ramped up to create much larger datasets, and so confounding factors will play an even larger (negative) role if not properly accounted for," the team explained. "[T]ackling these confounders in a rigorous way will help to pave the way for further discoveries in this burgeoning area."
For the current study, Listgarten and her co-workers developed an approach known as LMM-EH-PS, to distinguish between authentic eQTLs and bogus associations.
The approach relies on a so-called mixed effects model — bringing together information on the random effects of confounders and fixed effects of SNPs.
"The paradox of high-throughput studies is you have gobs and gobs of data … and people are scratching their heads and saying, 'How do we deal with this?'" Listgarten explained. "But the flip side of having all of this data is that can infer confounders from the data itself."
By assessing real and synthetic datasets, the researchers not only highlighted the need to account for confounders, but also provided evidence that their approach can be used to accurately pick out eQTLs.
The real datasets, provided by co-author Eric Schadt, chief scientific officer at Pacific Biosciences, included human data from a homogenous Caucasian population, containing expression artifacts but no population structure, and mouse data on various mouse strains, containing both population structure and expression artifacts, Listgarten explained.
Synthetic data, meanwhile, was used to test the power of the new model and compare it with competing models such as the Inter-sample Correlation Emended (ICE) method and the Surrogate Variable Analysis (SVA) method.
Indeed, the new approach seemed to outperform these alternative methods, which are used to address population structure in human data, Listgarten noted, but are not specifically intended to correct for combined population structure and expression artifact confounders.
"There really are no developed techniques that have been applied to correct jointly for both of those," she said. "This is the first piece of work, to my knowledge, that tries to tackle both of those jointly."
The team plans to make their software freely available online. From there, Listgarten said, researchers can plug in their SNP and expression data and get back statistical information about relationships between SNPs and genes.