Skip to main content

Q&A: Oxford's Mark McCarthy on Using Exome Arrays and Exome Sequencing to Study Diabetes

Premium

mccarthyhead.jpgName: Mark McCarthy

Title: Robert Turner Professor of Diabetes, University of Oxford

Exome arrays, exome sequencing, and whole genome sequencing are three of the tools researchers are using to uncover the rare variants underlying common diseases. In the case of type 2 diabetes, two large international consortia are using the tools to better define the genetic underpinnings of the disease.

The first, called the "Type 2 Diabetes Genetic Exploration by Next-generation sequencing in multi-Ethnic Samples," or T2D-GENES consortium, is performing exome sequencing in five populations, a deep genome sequencing study of very large multigenerational families, and a fine-mapping analysis of known T2D loci.

Meantime, the "Genetics of T2D," or GoT2D, consortium is looking at low-coverage genome sequence data, deep exome sequences, and genotyping information for nearly 3,000 individuals from four well-characterized European cohorts.

According to Mark McCarthy, a member of both consortia, these large-scale efforts have yet to identify many low-frequency alleles that have a significant effect on type 2 diabetes, results that could both lead researchers to even more focused experiments using whole-genome sequencing to uncover very rare variation.

McCarthy, who is a professor of diabetes at the University of Oxford, discussed these results and their implications for future studies during the joint Human Genome Meeting and International Congress of Genetics, held last month in Singapore. BioArray News spoke with him following his presentation at the conference. Below is an edited transcript of that interview.


According to your presentation, these large-scale sequencing efforts have not resulted in the discovery of variants with significant effects on developing the disease.

So far we have done about 4,000 genomes of various kinds, and we have also done about 13,000 exomes across a couple of projects. You immediately face an issue with rarer variants anyway. You start off with less power, because they are rarer. If it was the case that these rare variants would have a much larger effect size, then you would be in a position to see a lot of strong signals in a few thousand individuals. Now, we have seen some signals, and we have seen plenty of evidence of signal overall enrichment and so on, but the bottom line does seem to be that there are not a lot of low-frequency alleles that have a big effect on type 2 diabetes, which is one thing that had been widely hypothesized.

Others have believed, for example, that a complex phenotype like type 2 diabetes would fracture into many different monogenic diseases. Therefore, you might see it happening in the coding variation, which is one of the reasons we have focused on exomes and exome chips, as well as the genomes. Obviously, there are other financial and biological arguments for doing the exomes, but again, we are starting to see a few signals emerging, but it doesn't look as though coding variation is playing an over-dominant role when you get down to low-frequency variants.

We have also done deep sequencing in a number of Hispanic families from Texas to have a go at another hypothesis that is quite popular in the field at the moment, [which is] the idea that a lot of disease-causing variation might be very private and very rare, a reflection of the fact that there has been an explosion in the human population over the last few generations, meaning that we all have a lot of rare private variation. The argument goes that perhaps that is the kind of variation that determines our risk of disease because it is so recent, it hasn't been buffered, it hasn't been selected out. Perhaps common variants have less of a role in disease risk because the very fact that they are common means that they have been around for a long time and tolerated and therefore are less likely to have a large phenotypic effect.

We have been able to test that to some extent in about a thousand people from 20 large families that have been sequenced at depth. And again, we can refute strong versions of that hypothesis. It's always the case that you have to be very careful about what you can exclude, but I think that in all of those areas for which we have initiated sequence-based studies, we can start to exclude strong versions of the particular hypotheses that are bound to that study design.

You also mentioned how these studies are predicated on the model of genome-wide association studies.

A lot of what has been done with sequencing of common diseases follows the GWAS model, because that's what we know how to do, that's what we have got tools to do. That doesn't work badly for low-frequency variants, but when you get down to the rare stuff, particularly, protein-translating variants that might be very interesting from a biological perspective, they also tend to be very rare. And if we cling to the idea that we always have to get those to genome-wide significance, you are going to find very little, because it is very difficult to get a very rare variant to extreme levels of statistical significance. You may have to sequence hundreds of thousands of people to do that. At the same time, we don't want to step back into the bad old days of candidate gene studies, so I think the field does need to think a little bit about how it moves back to some principled fusion of the statistical evidence that a variant or gene matters, and the biological and functional evidence that a gene matters, and it may often require a combination of these two lines of evidence to provide a conclusive story..

Following the big GWAS studies of the past decade, the field seemed to embrace the common disease-rare variant hypothesis and has been using tools like exome arrays and sequencing to find those variants. But based on what you are saying, these tools have not yielded the results people have hoped for.

It's not going to be quite the revolution that some people had foreseen. But first, I would say there was a lot of bad press about GWAS and common variation. There was an underestimate of how much of genetic variation they explained, which for many traits is not too bad, getting up to 30 percent for some traits. That's a pretty sizable part. It's clearly been tougher to translate that through to biological inference because the variants that come out of GWAS tend to be non-coding, of modest effect. But that is slowly being chipped away at, and it has motivated efforts to think of high-throughput ways of doing functional analysis and so on, so I think that will bear fruit. The number of studies that have been done on low-frequency or rare variants still tends to be on the small side, and we are very much in the same stage there as we were in the very early days of GWAS when there were a few hits coming out and it's really taken five or six years to get a more rounded picture of exactly what variants we are dealing with here and how to develop biological inference from those.

It is definitely true that if we can find reliable ways of pulling out rare functional alleles, particularly in coding sequence, but increasingly from noncoding sequences … there is every chance that we can go a bit faster in terms of the inference. You can learn an awful lot about a disease by looking at protein-truncating variants where there is little doubt as to whether the variant is functional and involves knocking out one or two copies of the gene. And then you can make a pretty quick step from the genetics to the biology. The challenge is that when you do those studies from a genome-wide scale, you will see large numbers of apparently functional alleles distributed across the genome, many of them with hints of association. And, going back to what I was saying earlier, it's not trivial then to go about picking out the wheat from the chaff in a principled way. We are putting a lot of energy now into think about how we should go about doing that. We don't want to drop our thresholds to a point where we just invite a catalog of spurious claims but, equally, one doesn't want to get stuck with some really unachievable threshold that is very hard to move forward on what looks to be very strong biological stories. It's about evolving standards for new types of data.

What is your opinion of the exome arrays that are commercially available? Are the ones on the market the best that could be offered?

They are not necessarily the best that could be made now and they are not the best for any given study. People should look carefully at what is on them. I don't think there has been a problem with the design per se, if you accept that most of the genomes and the exomes that were out there at the time of the first designs were derived from Europeans, and so, inevitably, the array picked variants that were dominant in Europeans. Most of the content is pretty darn rare, and it will tend to be ethnic-specific, so it's no great surprise that if you try to type a bunch of Chinese or African-American samples with the array, you will have an awful lot of monomorphic content.

It is also inevitable that this is never going to be a fruitful approach for private variation: that's an experiment that should be performed with sequencing, not an array. We've now done about 55,000 UK samples [using the chips], and I think about two-thirds of the exome content is polymorphic somewhere in the UK. And it's in line with what you would expect given the distribution of exomes and genomes that went into the original study. So, I think the array is fine for what it is. But people should think about the content and if it is suitable for what they would like to do before buying it.

The fact that a million and a half samples of the catalog exome array were sold shows that there is an appetite for them. People were seeing how tough sequencing is, outside of a few specific areas — Mendelian diseases, cancer biology, et cetera. Time will tell what comes out of the exome array. If it turns out that in Northern Europeans it doesn't provide a lot of insight, there are two possible explanations: either coding variants are not important, in which case neither exome sequencing nor exome array is the way forward; or it may turn out that it's really the rare coding stuff that is useful, in which case the sequencing route is the only way you are going to harvest those. And I think that is what will play out in the next few years. Nobody knows the answer to those questions. Anybody who does tell you is just expressing an opinion because the amount of empirical data is pretty limited. Luckily, that is what is being generated now.

Is it sometimes better to run an exome array versus exome sequencing?

Whether it's better to do exome sequencing or exome array totally depends on what you are looking for. The exome array is going to be better for standing variation. You want to put things on there that are reasonably polymorphic, though most exonic variation is pretty rare. You have to accept that in a given sample, quite a lot of those sites will be monomorphic. If you are really after private variation, then you should be doing exome resequencing or targeted resequencing. And the trade off there is that it is a lot more expensive than arrays. You can obviously do more samples with the array than you can do with exome sequencing.

Plus some researchers still say there is a need for statistical power, that sequencing is too expensive, and that they need to run a lot of samples on arrays to obtain that kind of power.

That sums up the discussions that led UK Biobank to invest in genotyping at this stage. This will likely include a redesigned version of a whole-genome genotyping array plus exome content so that we can use the UK data to target the exome in a very efficient way. For example, out of the content that was already on the early versions of the exome array, we could put 100,000 of those SNPs to one side because we haven't really seen them in 55,000 UK [samples]. The value is that, particularly applied on that kind of scale, you can see hundreds of copies of individuals with a rare allele, and that is something that you will need to do to get convincing levels of statistical significance and enough copies to see clarity in terms of adverse effects.

UKBB has partnered with Affymetrix to genotype its 500,000-sample collection. What do you think will be the benefit of that?

These are 500,000 people who have all been phenotyped quite richly in a variety of ways, and on whom there will be genotyping done on the same platform. As a researcher, I'd rather have those 500,000 samples than the other hundreds of thousands we have imputed from different studies as part of various GWAS meta-analyses, simply because I think we are reaching the point where the effect size we are picking up is getting dwarfed by some of the technical heterogeneity in those studies. There is something to be said for doing this in a unified platform where you don't have to worry about the peculiarities of each data set. Here, you have a monolithic dataset.

In the case of UK Biobank, there is already quite a lot of data on these participants, and they will accrete more data in two main ways. First, there is a lot of active analysis of their biosamples and new imaging samples going on. All 500,000, for example, are getting a basic biochemical profile at the moment, and 100,000 of them will likely be MRI'd in coming years. There is a wealth of phenotypic data, and of course there is all the record linkage that will eventually tell us about their medical events and cause of death and so on. That means that, over time, the clinical phenotypes for these individuals will become enriched.

At the same time, if we can lay down a very good set of genotypes, we benefit not only from those genotypes and being able to look across all of those phenotypes, but we also get the benefit that we can impute from the growing amount of UK sequence data … and get an additional tranche of genotypes and that will get better and better. It's the gift that keeps on giving. Not only do we get better genotypic information because of our ability to impute from larger reference panels, we get a more accurate set of genotyping data for those individuals, and they will also become better from a phenotypic point of view.

I think we will have a very rich resource there that will allow us to test various hypotheses about the role of variants. If we see a variant of interest coming out of any of these disease-focused studies, provided it's captured from the array, we can immediately go to a group of people who are well characterized, see if we can confirm the association we can see, but importantly, ask questions about what else that allele can do. That is of particular interest for pharma, which is looking for human validated targets, but is also very keen to understand what the potential long-term effects are of manipulation of a given gene or pathway. What better way to do that? Find the people that have got the allele that does something that you would want to reproduce with a drug and ask, "What else do these people have? Is it a clean phenotype or a dirty one? Are they at risk of other diseases?" That will be of real help for pharma in tailoring targets they want to go after, so there is less chance of creating adverse effects.

How can these tools make an impact on your own research?

Let's start from the point that we know precious little about type 2 diabetes except for the fact that everybody accepts that it is a pretty serious problem. This is partly borne out by the GWAS data where, perhaps in contrast with many other diseases, many of our hits don't seem to map or chart into other bits of known biology. There is clearly plenty to be known about this disease and that hopefully will motivate efforts to treat and prevent it.

We are doing poorly at both. The escalating rates of diabetes around the planet tell that we can't really prevent it. There was a very nice paper in BMJ recently that showed how rates of diabetes went down in Cuba after the fall of the Soviet Union, and Cuba went through an economic collapse essentially, and the average weight dropped by several kilograms, people had to walk everywhere, didn't have a lot of food, diabetes rates plummeted. In the last decade, things have improved from an economic perspective, they have put on weight and their diabetes rates have gone up again. But it shows that you could stop diabetes in its tracks, but you might have to foment economic and social crises to do it. If we could make some progress towards better targets and a more rational principled approach toward thinking about new therapies, then I think that would be a really good benefit from this research.

The Scan

Pfizer-BioNTech Seek Full Vaccine Approval

According to the New York Times, Pfizer and BioNTech are seeking full US Food and Drug Administration approval for their SARS-CoV-2 vaccine.

Viral Integration Study Critiqued

Science writes that a paper reporting that SARS-CoV-2 can occasionally integrate into the host genome is drawing criticism.

Giraffe Species Debate

The Scientist reports that a new analysis aiming to end the discussion of how many giraffe species there are has only continued it.

Science Papers Examine Factors Shaping SARS-CoV-2 Spread, Give Insight Into Bacterial Evolution

In Science this week: genomic analysis points to role of human behavior in SARS-CoV-2 spread, and more.