Name: Bryce Christensen
Title: Statistical geneticist, Golden Helix
Genome-wide association studies, commonly referred to as GWAS, have in recent years arguably taken a backseat to studies that make use of technologies such as next-generation sequencing, pushing arrays further into the realm of applications such as consumer genomics. But that doesn't mean that people aren't doing GWAS. According to Bryce Christensen, director of services and a statistical geneticist at Golden Helix, the approach is still quite popular, with growth driven by large consortia, new adopters, and new customer segments, such as agricultural research.
Christensen discussed GWAS data analysis in two recent webcasts. In the first, held in December, he proclaimed that "GWAS is not dead" but "alive and well" and said the approach "remains a viable technology for genetic discovery." In that webcast, he discussed data formats, imputation, quality assurance, public databases, and other relevant topics. A second webcast, held last month, focused with greater depth on using data from public sources.
BioArray News spoke recently with Christensen about current GWAS data analysis needs. Below is an edited transcript of that interview.
In one of your talks, you claimed that GWAS is not dead. Was that because you think that most people see it as being dead?
From my perspective, working at a software company that sells tools for GWAS, we continue to get new customers and clients doing that kind of analysis, so it’s clearly not dead. From a broader perspective, you can talk to Illumina, Affymetrix, and elsewhere, and they will tell you that their GWAS chip sales are still quite robust. The difference is that we see more and more of the chips going into large consortia efforts, so you hear about organizations such as the Psychiatric Genomics Consortium that is doing a new GWAS project with over 100,000 patients for schizophrenia, so the profiles of the types of experiments being done are changing, but the technology is still being used quite broadly.
But haven't GWAS always been conducted by large consortia or projects?
From my perspective, it seems that the consortia are getting even bigger than they were. But also, as a software vendor, we continue to see a lot of people come in who are making their first foray into GWAS. They may be considered the late adopters in terms of the technology curve. There are a lot of people coming in who are either clinicians with some samples [that] they have been banking up over the years, or some academic researchers who have smaller sample size with unique phenotypes. Frankly, GWAS still makes sense for some of them. It's more cost effective than sequencing, and can be just as powerful depending on the study design they have. So, it's not dead. It's definitely not looked at as being the first line of action for most gene-finding activities, but there are still a lot of people doing it.
I should add that we see continued growth for GWAS in non-human markets, in agrigenomics, in food crops, in bovine and sheep and pork operations, GWAS has seen quite a bit of growth.
Have you done anything differently to address the needs of these agriculture-oriented clients?
A little bit. Last year we incorporated a new mixed model regression algorithm for GWAS that was requested by a large number of our plant and animal clients. Then we have had to make a few modifications in the software to allow for non-human genomes in other respects. But for the most part, a lot of the same methods that were developed over the past decade for human GWAS are very applicable for working with other species. The biggest issue they have with plant and animal research is the population structure that they deal with. A basic assumption of most GWAS methods is that the study samples are selected from a random-mating population, which is rarely the case in agricultural applications. You might have hundreds of cattle that were all sired by one or a few bulls, making it difficult to properly assess the statistical significance of results, and mixed model regression is a nice way to do that.
In the earlier waves of GWAS, researchers tended to use catalog arrays, like the Affymetrix 500K Mapping Set, or the Illumina HumanHap550 BeadChip. Now there are many more options, consortium-designed arrays, exome arrays, population-optimized arrays. How has that impacted Golden Helix?
For our company, it definitely introduces interesting questions that come to tech support when they find out that the workflows that they used for the Affy 6 don't seem to hold up when they are working with the Illumina exome chip, for example. For us, it hasn't required any major adjustment other than educating our customer base about the proper use of the newer products and making sure they understand that the chip design is different, and, as such, the study design needs to reflect that. They should recognize that with the exome chip, for example, you are going to have a lot of really rare variants that are going to be occurring below 1 percent frequency in your GWAS population, and so, if you are following a typical common variant analysis workflow, that is not necessarily going to work very well with the exome chip and you may have to take a slightly different approach. The tools are in our software, but we just have to make sure our customers know how to use them.
Similarly, look at some of the new high-density products, such as the Illumina 5M with about 5 million markers on the chip. People don't always realize that only about 1-2 million of them will be polymorphic and useful for GWAS-type analysis in a given population, and that is something else that we have had to educate people about a few times, and help them to take advantage of all that additional content, because, often, it is not obvious how to use it, especially for those people who are late in the adoption curve. The first adopters, the ones who designed these chips, know how to use that content and these tools. But there is a bit of a lag in the userbase, to some degree.
Meta-analyses are increasingly popular, and you discussed how to access public data in one of your webcasts. What has been your experience with public data?
Public data is a vast treasure chest. There is a lot to be learned. It's also a relatively cheap way to do some really good research. A lot of people aren’t fully aware of the data that is available in public repositories. We also encounter many people who know what is out there but don't know how to use it. They are either overwhelmed by the bioinformatics expertise that is required to access and analyze the data, or else there are other data use restrictions that make it difficult for them to access the data. There are some challenges to using it, but there is a lot to be learned, and that is why so many people want to take advantage of it. We shouldn't ignore all of the data that is out there, because there is obviously more to be learned from it.
This leads us to the issue of imputation. If done correctly, imputation can help you achieve what you set out to do, but it also increases the possibility for errors. So what are your recommendations when it comes to imputation?
I have mixed feelings about imputation. In my experience, imputation is especially useful when you have GWAS data from several different platforms and you want to harmonize those platforms to allow for a more thorough meta-analysis. It’s also very helpful when you have a GWAS signal and you want to learn more about what's happening in that area. But there is a common belief that the reason to do imputation is to find something you wouldn't have found in the GWAS data and frankly that doesn't happen very often. When you look at the math and the statistics behind it, it's rare that you will introduce a valid new signal by doing imputation. That is one of the points I made in the GWAS webcast. It’s okay to pursue that path as long as your expectations are realistic.
You also need to seriously consider the content of the reference panels that we use for imputation – this is also a concern that I have about GWAS in general and something that I see as an important source of missing heritability. With GWAS chips, the current generation are based largely on content coming from 1000 Genomes. The previous generation was based on content largely found in the HapMap project. In both cases, you are testing SNPs that were identified in a relatively small population of presumably healthy people. With imputation, we are usually estimating genotypes based on these same reference panels – genotypes for SNPs that are common in healthy people. To me, that is kind of a disconnect. You get more statistical power if you use people with the disease to design the discovery panel. So if you have a GWAS panel based on sick people, instead of one that is based on a handful of healthy people, you are more likely to find something that is related to the disease. And that's where we see the value in some of these disease-specific chips that are being developed, the MetaboChip, the ImmunoChip, the PsychChip, and I believe that others are in development. We are getting a bit smarter about putting content on the chips that has prior probability of being related to the disease.
You will be launching a new version of your software, called SNP & Variation Analysis Suite 8, in a few weeks. How does it improve on your previous offering?
The most visible change in version 8 is a major upgrade of the genomic visualization tools. We have fully integrated our GenomeBrowse product, which was originally released about two years ago as a freestanding application for viewing raw NGS data. We have also done a lot of work in SVS8 to streamline and simplify workflows for next-generation sequencing analysis. For GWAS applications, our customers may be excited about the reintroduction of haplotype trend regression, a method for associating haplotypes with quantitative traits that was found in older versions of our software but was absent in SVS7.