Skip to main content
Premium Trial:

Request an Annual Quote

Q&A: Dana-Farber's Quackenbush Ties SNP Analysis to Networks


CHICAGO (GenomeWeb) – John Quackenbush, a researcher in computational biology and bioinformatics at Dana-Farber Cancer Center and Harvard Medical School, keynoted the clinical genomics track at last week's Bio-IT World conference in Boston. In introducing Quackenbush, track chairman Leonard Lipovich of the Wayne State University Center for Molecular Medicine and Genetics called him a "visionary" and a "maverick."

Quackenbush did his best to prove Lipovich right by presenting research — some new, some previously published — that attempted to show that other institutions' methods of understanding the functions of single-nucleotide polymorphisms were not completely up to the task.

In his talk, Quackenbush mentioned a newly published article in the journal Nature Genetics in which European and American researchers had linked 52 specific genes to human intelligence. What the study did not find was definitive proof that these genes determined whether an individual was smart; in fact, the researchers said, there likely are thousands of other yet-undiscovered genetic links to intelligence.

"They all have very, very small effect sizes. Each one only explains a tiny, tiny percentage of heritability," Quackenbush said.

That was exactly what Quackenbush was trying to show with his own work, which largely concentrates on the expression of quantitative trait locus analyses.

"We assume that phenotypes are defined by networks," he explained. That means there is no single, right network for determining genetic relevance, Quackenbush said in a conversation with GenomeWeb prior to his Bio-IT World presentation.

Below is a transcript of the conversation, which has been edited for clarity.

How have you been able to associate SNPs with function?

We hit on this idea of looking at expression of quantitative trait loci. It fell into a broader paradigm of how we think about networks. What most people do is unprincipled, and therefore wrong. They'll look at things like gene expression or differentially expressed genes. They'll project them onto some sort of protein-protein interaction network and then tell a pretty story. There are lots of doubts I have about why this isn't really relevant. Are things that are differentially expressed connected to each other? 

It's a question of correlation or causation?

Yeah. I think it even goes beyond that, though. You see an association between things in one state and you're trying to project that onto other states.

There is a whole group of approaches around looking for correlations, so you build these correlation-based networks. I am not sure what is correlated here should be correlated there, and you don't really model causality. We have been building a whole series of methods around the idea that if we think about phenotypes and define them, those phenotypes should have defined patterns of expression. They should have unique patterns of gene regulation. Really, what you end up doing is building individual networks and then comparing them. You're asking yourself, "What does the structure of the network tell me about the underlying biology?" 

So this a custom build each time or for each type of analysis?

You do it each time, for each type of analysis. We actually developed a method over the last few years where I could infer gene regulatory networks for a population, but then extend that to infer gene regulatory networks for each individual in the population.

The way I would do that is simple. If we had an auditorium full of people, I'd take samples from everyone including you and then infer a network, then I'd set you aside and I'd infer a network for everyone but you. If I look at that, the network is actually slightly perturbed because I left your data out. That small change in the data is actually going to shift the network.

The way we're conceptualizing that work is using what's called an adjoint of matrix. Our networks are built around the principle that we have transcription factors regulating genes. We built this network that we can represent as a matrix of connections. By leaving you out, I shift those connections slightly. We even look at networks at that level of granularity. One of my [postdoctoral students] has shown that you can take these network edge estimates and actually use them as biomarkers to predict things like outcome of disease.

You are applying genomics to population health? 

Yes, this is genomics for population health but also individual health.

The idea is that the structure of the network itself tells you something about the underlying biology. Our philosophy is there's no single right network.

People always ask, "Well, how do you know the model's right?" I can't answer that question. The question we ask is: "Is it informative, does it tell us something?"

We have a series of methods that we have published or are under review, and what they allow us to do is to infer gene regulatory methods, to infer these single-sample networks. We have a method called ALPACA for teasing out changes in network topology that's much more sensitive than anything that has been published to date. 

What is driving this research?

I was motivated in part by the idea that we have lots of SNPs that come from genetic studies that have no known function.

The other thing we see is that there are lots of SNPs discovered in [genome-wide association studies] that have very small effect sizes. If you need 10,000 SNPs to explain 29 percent of height, which we know is heritable, what that tells you is that all these variants are really exerting very small effects.

Are others perhaps giving too much weight to certain genetic profiles?

I think we're trapped in a "Mendel and his peas" [mentality], where we're looking for these single, big effects, even though we're digging deeper and deeper. This study of height had 250,000 individuals. The same group did a study of [body-mass index] with 340,000 people. Ninety-seven SNPs explained less than 3 percent of BMI, while all common SNPs may explain 20 percent of BMI. The answer is rare variance. There was a study last year in Diabetes. Rare variance explained nothing.

Everybody is looking for these linear effects, and the linear effects just aren't there. 

It's more of a cumulative effect?

It's cumulative, but our questions were, "How do these things work together? Are they additive or are they multiplicative? Is it something else?" 

We started with this idea that we would do eQTLs. It's a simple model. It's a regression analysis. At each SNP position, we have a homozygous common variant, heterozygous common rare variant, and a heterozygous rare variant. You have three states. Does the gene expression increase or decrease? Or do you need to use mutual information and look for some other association? Do you see an association between genetic variants and gene expression?

It's a pretty simple model. And we realized we could take the SNPs of the genes and represent them as a bipartite graph.

When we draw graphs, we typically draw balls and sticks. The difference here with the bipartite graph is that we draw squares and circles. The rule is, squares can only be connected to circles and circles can only be connected to squares. In this model, the SNPs don't influence each other and the genes don't influence each other. That's trick number one.

Trick number two is that when we did this, we kept both the cis- and trans-acting eQTLs. A cis eQTL is a SNP that's immediately adjacent to the gene it influences. A trans eQTL is one that operates at a distance. You have these cis- and trans-acting eQTLs and we put them together in these bipartite graphs and then we start to ask, "What does the structure of this graph tell us?"

When you look at the graph itself, you get a huge hairball, which doesn't tell you a whole lot, but you can start to look at the properties. We focus on the degree distribution for the SNPs. We took all the NIH GWAS in chronic obstructive pulmonary disease and we mapped them and what we found was the hubs are absolutely devoid of GWAS hits. (This study ran in PLoS Computational Biology last year)

One of the things you see is that the leaves are underrepresented and that there's a general underrepresentation in the middle. Why are the hubs a desert? There are a couple of arguments. One, they're rare so the chance of hitting anything there is small. The other thing we think is that it's survival bias. If you have a variant that's really deleterious that hits here, you just don't see it.

We do this in chronic obstructive pulmonary disease and we saw this over-representation in the middle. We saw this structure of where things were so we asked ourselves, "Does this network have some global structure?"

Communities actually convey a lot of information, and there are methods for identifying communities in networks. We adapted a method for working with bipartite graphs. Basically, you look at the over-representation of links by chance and you maximize a function called the modularity. What we see in this eQTL network in COPD is that it has a very highly modular structure. There are global hubs definitely that are connected to lots of things, but you also have these very, very tight communities.

Each one of these communities has kind of a local hub. So we started to ask ourselves, "Is there additional information contained in here and what does it mean?" What the model is telling us is that we're organizing genes and variants into functional communities. In fact, this was done in lung tissue in COPD. A lot of these functional communities have very natural interpretations in the context of what we know is happening in the lung.

We moved from the Mendelian single variant, single trait now to a family of variants that influence a process. This is much more consistent with what we see in these genetic studies, where we have thousands of variants, all of which are subtle shifts.

We then looked at these communities and realized that they have local hubs, so we defined a metric we called a core score. It's basically the fraction of the modularity that a single variant carries. We took a meta-analysis of COPD. There are 34 GWAS SNPs in this meta-analysis associated with the disease, and we mapped them to the communities. Thirty-three of these mapped to three communities, with one outlier. Those three communities all have functions that make sense, given what we understand about the disease, but the really interesting thing was that when someone looks at the core scores, the SNPs identified through GWAS have core scores which are 23 times higher than the median for nonsignificant SNPs. 

What did this tell you?

I have a genetic variant, the likelihood that we're going to find it in a genome-wide association study is associated with how central it is to its functional community. That makes sense. If you have a complex disease that's not highly penetrant, you don't fall over and die, but it perturbs functions that lead to development and progression of disease.

We went back and asked how generalizable this is. We spent the last year and a half going back and reanalyzing data from [the Genotype-Tissue Expression project, which has gene expression and quantitative trait loci from 53 human tissues].

We asked what we can do with that gene expression data. We developed a method for normalization and quality control of gene expression across all the different tissues. We found one sample that was misidentified by sex, which is almost the best we've seen in any study. We see as many as 15 percent of the samples misidentified by sex in most studies.

We do a lot of quality control on the data. We compared cell lines to primary tissue. We studied sexual dimorphism in 32 tissues. We looked at tissue-specific gene regulation in 38 tissues. We actually took genetic imaging data because there are histological slides and we showed that genotype can predict imaging features which are related to disease.

We took the GTEx data and we ran our eQTL analysis. We did pretty standard stuff. We inferred the genotype and computed the genotype so we didn't have any missing data. We used standard methods. We looked at our expression of quantitative trait loci and then looked at communities. There were 13 tissues that had enough data to actually do this.

What we see is pretty much the same story of what we saw when we looked at COPD. We get these highly modular tissues, highly modular networks that have very intricate structure, and we picked heart right ventricle as an example. What we see across all these tissues is really the same thing. 

How did the research community receive this?

We tried to publish this, and one of the referees said, "Is this driven by coexpression?"

The answer is not the average coexpression in these communities. The R squared is close to zero. We have some tiny communities where coexpression drives it, but it's really genetics influencing traits.

We looked across the 13 different tissues and we asked, "How much function is shared?" What we found is that a lot of the communities that are enriched for function actually exist across multiple tissues. You actually see communities with [gene ontology] terms that contain genes that overlap significantly, have similar functions, and these appear in every single one of the tissues.

One of the big arguments in the whole eQTL community is how much is shared across tissues. I wrote a grant to try to investigate this and I got two reviews for that application. One said that there's no reason to look at this because everyone knows eQTLs are going to be identical across tissues. The other referee said there's no reason to look at this because everyone knows that eQTLs are different across tissues.

The answer is that most of it is the same, but what's really interesting is we start to see tissues as a community. Going back to heart left ventricle, this is designed to show that it's not driven by genes that are all from the same chromosome. We have enrichment in this particular community for functions that are related to the ventricle.

Just like when looking at COPD, when we make that degree distribution and map the genetic variance, what we find is that the hubs would never have disease-associated variation. In every case, we see a significant difference in the average core score between GWAS hits and non-GWAS hits. We're seeing the same story emerge from all of this.

And then we do the genetic variance. We started to ask ourselves, "Is there difference in their functional potential?" It turns out that, yes, the genetic variances at the core of the communities are far more likely than other variances in the network to have a functional association. But even more, when we look at those local hubs that appear in tissue-specific functional communities, they're driven by two things: tissue-specific gene expression, but also, genetic variance that, based on the Roadmap Epigenomics Project, falls into regions of tissue-specific open chromatin.

So what is the significance of your methodology?

What this does is give you a different way of interpreting what we see in genetics. Why I'm so excited about this is that this model shows a structure of a network that is incredibly informative. It tells us that when we look at cis-acting eQTL, those are shared across tissues. But you also have tissue-specific variances. Those variances tend to be in tissue-specific open chromatin and create their own functional communities that make your liver cells different from your brain cells.

And, when we look at those communities, the likelihood of any genetic variance being disease-associated is tightly linked to how likely it is to perturb function.

We've gone from this mess where we have all these variances and don't know what to do, now to a way to really think about explanatory power of a model for helping to sort out what these variants do. For me, it's just a really different way of understanding this question of regulation, which, to date, I can tell you, we have no thorough understanding of. 

How are you going to get that understanding, just with experience?

I think this is already giving us an understanding. This is the first model that points us to something that we can interpret. There are a lot of things we want to do.

It's a new idea and I think it really provides a framework for interpreting biology, and that's something that's devoid of so many other models. The reason I get excited about this is that it's really starting to give us insight into the drivers and the functional drivers of genetic variants and is really helping to explain how these small variants together can shift the system. 

Since you're doing whole genomes, you ought to be able to find other disease associations without too much extra work.

Exactly. We've done interesting things. We had a collaborator who brought breast cancer SNPs who couldn't figure out what they do. We took those breast cancer SNPs and we mapped them to the relevant tissue communities, and what we found was that the breast cancer susceptibility variants mapped to two communities. Those communities, not surprisingly, were associated with the cell cycle and with hormone signaling, hormone response. They had been wrestling with trying to put these variants into some kind of context. What we told them is that the variants fall right where you expect them to fall. It points to functional communities that make sense in the context of this disease. 

Sometimes the last place you look is the most obvious?

Yeah. We've been looking at a lot of other disease associations, and what we're starting to find is that in the right tissue, the variants are falling into functional communities that tell us a lot about the underlying nature of we're trying to study.

Can we link this directly to therapies in patients? Not today, but if you think about estimating genetic risk, this at least gives us a framework to start to move in that direction.