A team of researchers at Children's Hospital of Philadelphia and elsewhere have developed a new statistical method that allows scientists to move beyond SNP hits to focus on genetic networks and pathways when doing genome-wide association studies.
The researchers describe the approach, and how it enabled them to discover a key pathway connected to Crohn's disease, in the March 13 issue of the American Journal of Human Genetics.
In the paper, the authors point out that in genome-wide association studies, the most significant SNPs in one study often do not show overlap with SNPs in other studies. In addition, SNPs that are known to be associated with a disease often do not rank on the top of GWAS SNP lists.
The group explored a method that ranks pathways by their statistical significance as a way to tease out more information from a GWAS — an approach they describe as "analogous" to gene set enrichment analysis for gene expression studies.
Using the method with Affymetrix SNP genotyping data from the Wellcome Trust Case Control Consortium, they found a link between Crohn's disease and 20 genes in the IL12/IL23 pathway. Some are known to have a connection to the disease but others do not reach genome-wide significance through single-marker association studies.
The method associates each SNP to overlapping genes or closest genes if there is no overlap. Each gene is assigned a statistical score and then a so-called Kolmogorow-Smirnov-like running statistic pulls out the genes overrepresented within a gene set. A correction procedure adjusts for varying gene sizes and linkage disequilibrium between SNPs located in the same gene.
BioInform spoke with study co-authors Hákon Hákonarson, director of CHOP's Center for Applied Genomics, and Kai Wang, a computational biostatistician at CHOP who developed the tool, this week. The following is an edited version of that conversation.
What is different about your genome-wide association study?
Hákon Hákonarson: The investigators I have shared the study with so far have been very impressed with the way of teasing out more information from the genome-wide association study than you can under conventional analysis. Here, we basically capitalize on the fact that there is a specific pathway that comes up and there is no hypothesis behind that. It is totally independently driven.
The pathway makes sense since one of the genes that pops up is known to be a key driver [of Crohn's disease.] This method captures multiple members in the pathway, a gene network within that pathway, all of it is associated with a disease.
This gives you much more flexibility, for example, in terms of targeting any of these genes for therapy [and drug development.] Most research has as its focus the receptor, which is known, and [doesn't] necessarily consider these targets. But now we know they are also biologically linked in the pathogenesis of the disease.
[ pagebreak ]
Some scientists say that research teams studying complex diseases might, for lack of better tools, tend to pick out favorite genes or just the top-ranked genes. How can your method help with that?
HH: That candidate-gene approach is when people hand-pick a gene because they like it or they think it is a good gene. The algorithm that Kai [Wang] used here doesn't take any hypothesis into account. It totally looks randomly at the genome and asks the question: Are there any clusters of genes anywhere in the genome that associate with this particular disease?
After he identified that pathway, he replicated that in three independent cohorts [a group of 647 patients with pediatric onset inflammatory bowel disease and 4,250 control cases, 1,083 pediatric-onset Crohn's disease patients, and 2,507 control subjects, both of European ancestry and a set of 40 African-American Crohn's disease cases and 527 control subjects who were also African-American].
Working without a hypothesis doesn't sound like science.
HH: Pathway analysis is also often done by researchers picking their favorite pathway. This method totally independently comes up with a pathway, independent of hypothesis.
The validation with the additional cohorts perhaps helps see the connection to the disease, but does it also validate the algorithm?
HH: It certainly validates the disease, but I am sure the approach was important to Kai [Wang] to also validate the algorithm.
Kai Wang: We discovered a pathway based on the data of the WTCCC, which is generated on the Affymetrix [GeneChip Human Mapping 500k] platform. For the validation we used several datasets all of which were generated on the Illumina [Human Hap550] platform. So the markers on these two platforms are basically almost 95 percent different.
Normally we wouldn't use different platforms for validating results unless the markers are identical. But for a pathway-based approach, because we are only looking at the genes rather than individual SNP markers, we can validate the results from WTCCC.
The SNPs are different but they target the same genes and pathways. That's why we can replicate the pathway finding with a different platform and with a different ethnic group. The discovery was made using a UK cohort of European ancestry but in our validation dataset we have a relatively small African-American cohort, allowing us to validate the results in different ethic groups.
Was it medically relevant to select an African-American group?
HH: The markers used to study Caucasians and African-Americans have been found to have very different prevalence. But most discoveries in the field today have been made in Caucasians and have not been found to play a meaningful role in African Americans.
But in this case, when looking at the genes in the pathways and not the individual markers per se, we see the same enrichment in the African-Americans. That tells us this pathway is also important in African-Americans but nobody showed that through the individual markers before.
The pathway you have implicated, pathway IL12/IL23, is not an established pathway, right?
HH: You could almost say that the software almost creates its own pathway, because it tells us: These are the genes that associate with the disease. Then when we looked the results we found that they belong to these two gene networks, the previously defined IL12 and IL23 pathways.
[ pagebreak ]
You caution in your paper that assigning a test statistic for a SNP to its closest neighbor may be incorrect. Can you explain that?
KW: We need to take some caution when interpreting the data. Although in most of the cases the SNPs closest to the gene are targeting this gene, it may not always be the case.
For example, the SNP between IL23R [a well-known Crohn's disease susceptibility gene] and IL12RB2, because it is a little closer to IL12, we think it belongs to that gene. But it is possible the SNP is tagging a common variant within IL23R.
Is this algorithm a tool that can be applied to other genome-wide studies or is it just applicable to Crohn's disease?
KW: It's a general tool and can be used on any disease as long as it is a GWAS study. It's called Pathway Association software. We had a method paper on this software in 2007.
HH: That paper was for the development of the tool and this paper shows the tool that has been slightly improved and also shows its application.
How does this method differ from gene set enrichment analysis developed by Eric Lander and his colleagues?
KW: The original gene set enrichment analysis was developed for microarray data analysis. The approach that we developed is very similar to GSEA but with small modifications to handle GWAS data. In a gene expression study you have one or two probes for each gene or transcript. In a genome-wide association study, each gene may have a few or a dozen or a hundred SNP markers. How to associate SNP markers to a gene is a different issue than in microarray studies.
Another issue is that some gene sets may have a couple hundred genes, another may only have a dozen or 30 genes. How we make it work with sets of different sizes … is another challenge.
In a microarray you get two groups of tissues and can do a simple t-test. For SNP arrays usually we do association tests using genotypic models or trend tests or allelic association tests. The statistics itself are different in addition to needing to relate SNPs to genes.
HH: There are a number of significant deviations that have to be built in to be able to use the software on SNP data.
While this tool has been under development have you helped other research teams use it?
KW: Several scientists have used it to interpret their GWAS results, but I haven't monitored the number of downloads.
HH: Within Children's Hospital, we generate most of the genome-wide association study data. We run samples and analyze the data for them or with them in some sort of collaboration.
The other software Kai developed called PennCNV, which handles copy number variation [from SNP genotyping arrays], has found users in several hundred academic and industry research centers.
So I think this one is going to catch on the same way. The paper is just out and people are still figuring out what it means. Even though there may not be a huge number of researchers who picked up on the methods paper, this is probably going to be more compelling to them because it has a major disease application.
Do you plan to continue to develop this tool?
KW: I plan to continuously update the software. In addition, since the algorithm has been published, I guess more and more software developers will try to make a similar algorithm and create their own software, with a better user interface, and possibly make it a web-based tool.
What kind of application can this method find in drug discovery?
HH: Let's say a pharmaceutical company is convinced that IL12 or IL23 signaling is important. They would very likely be making an antibody against these receptors.
Now there is an array of 20 targets that open up and all of them show biological association with the disease. It really makes no difference whether the risk contribution from a gene is 10 percent or 50 percent or 100 percent, that individual gene if it is biologically linked to the disease can be as important or more important than a gene that has a risk factor of 500 percent. This opens up the opportunity to target other members of the pathway, which may be better drug targets than the receptor.