Name: Shamil Sunyaev
Title: Assistant professor of medicine, Harvard Medical School, and research geneticist, Brigham and Women's Hospital, since 2002
Experience and Education:
• Postdoctoral fellow, then scientific staff member, European Molecular Biology Laboratory, Heidelberg, 1998-2002
• Research fellow, Engelhardt Institute of Molecular Biology of the Russian Academy of Sciences, 1997-98
• PhD in biophysics, Moscow Institute of Physics and Technology, 1997
• MS in biophysics, Moscow Institute of Physics and Technology, 1994
Shamil Sunyaev's research at Harvard Medical School focuses on analyzing human resequencing data from an evolutionary, functional, and medical genetics perspective, and on developing methods for predicting functional effects of mutations.
Last month, Sunyaev and his colleagues published a paper in PNAS on how resequencing all human exons could enable researchers to discover genes related to human traits.
In Sequence recently spoke with Sunyaev about his study and what it means for planned large resequencing projects.
How did this study come about?
The interesting question is whether sequencing technology will actually revolutionize human genetics — what you can find in many human genomes, compared to genotyping association studies and things like that.
For this to work, there should be three things that should be satisfied. One is that most rare variants you see in the genes are not neutral. Because if you see a lot of variants which are not involved in anything, then this is noise, which would eliminate your signal very soon. We thought that population genetic analysis, comparative genomics, and also data on site-directed mutagenesis suggest that this is true, that if you look at protein-coding genes, most new mutations have some function; they are not completely neutral.
The second assumption is that natural selection is not powerful enough to eliminate all of the mutations. Imagine that all new mutations which are functional would be immediately lethal. Then however many people you sequence, you do not see these variants in the population at any frequency. So we argued in that paper that selection is not strong enough, that most of these new mutations are moderately or weakly deleterious mutations, meaning that they can segregate for some time in the human population, and if I sequenced more and more individuals, then I would find them.
And the third [assumption] is that most of the variants would shift phenotype in the same direction, if you think about genes.
Then we were thinking about the theory and feasibility of the design, and the idea that came to us is the following. ... Imagine that you are interested in a specific phenotype, and you have a specific candidate gene, and you can do genotyping, you can find frequent polymorphisms, [which is] what people are doing with association studies. Fundamentally, what changes is that it is likely that your gene doesn't harbor high-frequency functional polymorphisms. Most genes wouldn't. However, if I sequence enough individuals, I will start finding variants. And if the three assumptions are satisfied, many genes would have variants, you would start discovering more and more variants like this. So if there is a gene that is mutated and has an effect on a specific phenotype, if we have sequencing technology, we can make this gene susceptible to genetic analysis, even if it doesn't have high-frequency variants.
[ pagebreak ]
The idea was to simulate and to see how many people you would need. Then I started to talk to people around [here at Harvard], ... and the question came up whether you can use that design, not for doing candidate gene studies but doing de novo discovery of new genes and underlying phenotypes. This is what you want to do in human genetics.
Initially I was thinking, ‘This is ridiculous’ because sequencing will never be able to identify new genes and underlying phenotypes, looking for rare variants, because you will have to do a lot of sequencing. And since [then] I [have been] convinced by my colleagues that we should try to see how many people we actually have to sequence to make this possible because the cost of sequencing and of phenotyping will go down, so thinking about thousands of people is not ridiculously silly.
We decided to simulate that, because we had a population genetic model ... Basically, we can model human populations, a big resequencing study, in silico, using computers.
... This is very different from association studies in two respects. The first respect is that we are not asking what genes explain most of heritable variance, but what genes, if mutated, would have an effect on the phenotype. This is like a genetic screen question.
Also, in our simulations, we never assume frequencies of specific genotypes. We model the process of mutation, we model mutation rates, selection, this is a very explicit evolutionary model. So we basically model huge resequencing studies and ask, 'If I have this gene, can I find it de novo in an unbiased fashion from resequencing data, and can I find all genes underlying all phenotypes in sequence data?'
How could the results from your simulation inform future large-scale resequencing studies?
I think for study design, there are two messages here. One message is that it seems it's going to work in principle. So if we have large populations and mature technology, inexpensive sequencing, we are going to find things which we are unable to find using genotyping. I think that this enthusiasm about the development of new sequencing technology has some value for human genetics.
What do you mean by 'large' populations?
There are tons of variables here. But we claim that as soon as you can have roughly 100,000 phenotyped individuals, for some phenotype of interest you can do 5,000 on each tail, [so] you have to sequence 10,000 exomes. This would be sufficient to have for genes with half-standard deviation effect size for new mutations. This is not the fraction of variants explained, this is the effect of new missense mutations, which is very different. This created a lot of confusion in my presentations because the effect sizes I am talking about are not effect sizes people frequently talk about. So for those genes, we can find over 75 percent. And for smaller effect sizes, it's over 40 percent.
There are things you can do with smaller samples as well, and we write about what you can do with 1,000 genomes. ... You can find genes of extreme effects, like very large effects about two standard deviations, which probably will be rare variants, those which cannot be found by linkage, but still, there may be some. You can find genes which are extremely long, or you can combine genes by pathways, because the analysis is limited by the amount of variation. So as soon as you have candidate pathways, this may work.
The whole thing works much better for longer genes than for shorter genes. The question is, how many mutations do you see? And this linearly grows with the size of your gene. And since the unit of the association test now is not a SNP, [but] it's a gene, the longer the gene, the [greater] accumulated frequency of mutations you have. This is a very important message there, and this was very visible in our simulations as well.
[Another] fundamental message is that, for most cases, you are very unlikely to have enough variation so an analysis in a sample of hundreds would make any sense. So these candidate-gene based studies on obesity, blood pressure, lipid level, and so forth, they cannot be repeated from genome-wide data, just because you do not have statistical power to do that.
So the positive message is, ‘Yes, this is going to work, it's going to work in an unbiased fashion.’ I do believe that sequencing will revolutionize human genetics, which is not a trivial question, because if you think in terms of standard association design, you find a lot more rare variants, you have very limited power to analyze rare variants, because if I sequence a population of 1,000 people but I see my variant twice, there is nothing I can do. And also, there is multiple test correction, I have so many variants in the sequence data that if I do millions of independent tests, in order for my test to be significant, it has to have enormous power. So generally, whether sequencing data would change the world was a very difficult question for me. I think what we are trying to argue is that it will, but ... underpowered studies on small samples would not be feasible.
[ pagebreak ]
Are you aware of any specific initiatives that are trying to use these results now?
I know NHLBI funded exon-sequencing projects [plan to]. One of the people behind this initiative spoke to me about using these results for designing these studies. I know some people have plans to sequence very large populations, even for some genes, and see whether this holds. Because what we are claiming is, if you have a gene where you know there is the effect, and you sequence very many people, your p-value becomes 10-6, meaning that you can find it from a genome-wide dataset, so this would be sort of a proof-of-principle experiment.
How about mutations in non-coding regions?
That's a very important discussion. We have several papers on the variation of coding regions, and I don't want to be misinterpreted. I do believe, actually, that most of the action in the genome is in non-coding sequence. The problem is that the effect of mutations in non-coding sequences, even in sequence at the same level of conservation as protein-coding genes, seems to be weaker. So you lose power of this analysis because you look at weaker mutations.
Second, if I am staying within coding regions, I can combine mutations together because I know they belong to one gene, and this is a functional class. I have very little idea what I do in non-coding regions with this. So I don't think this will be as successful as the coding regions.
We have two different questions in genetics: One is, find genes underlying phenotypes. Another one is to explain genetic variation – so can we explain variation of blood pressure? And if you follow this debate about genome-wide association studies, ‘Where are the remaining variants?’ I think we will not be able to answer this question with this study design, and I don't know how to use sequencing data to answer this question of explaining genetic variation, this Gattaca kind of thing.
So the claim is, it's not that we believe everything is in coding regions, but I think that we can use coding regions, because it's feasible to do, to pinpoint genes.
There are two outcomes, I think, where this works in terms of the design. If I have a fixed pot of money, do I want to sequence few individuals with complete genomes, or many individuals with part of their genome? The second question is whether that part of the genome should be some region of the genome, [whether we should] go by conservation, or functional genomes, or take the exons.
And I think for now, for today, the best use of the money would be to sequence exons in large populations, rather than sequence small populations of the whole thing, or some specific noncoding regions. That comes out of our analysis, at least for this particular application.
In the future, I hope we'll find some way to analyze complete genomes. I don't think people doubt that we will have hundreds or thousands of complete genomes in the foreseeable future. Maybe not next year or the year after next, but within a horizon of 10 years, for example, that is very feasible, and we will probably find some ways to analyze them.