Investigator, Genome Dynamics and Evolution Group
Wellcome Trust Sanger Institute
Name: Matt Hurles
— Investigator, Genome Dynamics and Evolution Group, Wellcome Trust Sanger Institute, since 2003
— Co-leader, Structural Variation Project, Wellcome Trust Sanger Institute, since 2005
Experience and Education:
— Research fellow in population genetics, McDonald Institute for Archaeological — Research, University of Cambridge, 1999-2003
— PhD in genetics, University of Leicester, 1999
— MA in biochemistry, University of Oxford, 1996
Matt Hurles has been studying structural variation in the human genome, in particular copy-number variations, using several methods, including microarrays and, more recently, new sequencing technologies.
In Sequence visited him at the Wellcome Trust Sanger Institute in June and talked to him about his latest projects, and what role second-generation sequencing plays in them.
What is your role here at the Wellcome Trust Sanger Institute?
I run a faculty group here, and I am also one of the principal investigators on the Copy Number Variation Project, together with Nigel Carter and Chris Tyler-Smith, although we are changing the name of that project to the Structural Variation Project, largely because of the influence of the new sequencing technologies and our ability to detect things like inversions that we could not previously detect using the arrays, which is what we focused much of our work on.
Tell me about that project. When did it start, what is its goal, and how has it changed over time with the new sequencing technologies?
[The] Copy Number Variation [project] started in early 2005, and the goal was, really, to discover [and] characterize structural variation — copy-number variation — in the human genome. To characterize the genomic landscape of it: How much is there, what size is it, how does it break down to different types of copy-number variants?
The aim was to work out the genomic properties, but an explicit aim was also to identify all common copy-number variants to enable us to do genome-wide association studies … in the same way that we can do [using] SNPs.
As the technology has advanced perhaps more rapidly than we predicted, aims have become somewhat grander. [The goals], in the outset, were very much to identify common copy-number variants and then to demonstrate that they have functional impact, and to do the work that would enable us to integrate them within human genetic studies, disease-association studies, or evolutionary studies. I think now, we are doing far more of that, embarking on association studies, in collaboration with large consortia like the Wellcome Trust Case Control Consortium. We are also doing evolutionary studies as part of that.
It’s really clear from our first-generation map of copy-number variants [in HapMap samples] that we published in 2006 [in Nature], that studying those HapMap samples does allow you to do some interesting population genetics. That is one way of demonstrating functional impact. If you can show that there are copy-number variants under selection, then that suggests that they are having an important functional impact.
What technologies did you use to generate that first map?
That first map was based on an in-house developed clone-based array of large-insert clones, somewhere about 27,000 clones, as well as an early-access version of the Affymetrix 500K array, in collaboration with Affymetrix. We used those two platforms across the initially 270 individuals of the HapMap. We went through a lot of trouble to show that these were not somatic artifacts, that they were really germline variants, and persuade people that it wasn’t some technical artifact that we are observing, and therefore, they are worth doing genetics on.
Increasingly, we have been trying to annotate that initial map with the copy-number variants that are functionally important, for example, ones that influence gene expression. Together with Manolis Dermitzakis’ group here, we did an association study that was published in early 2007 [in Science], showing that those copy-number variants can have a substantial impact on gene expression.
There are really three ways in which we have been annotating what we think are the functionally important copy-number variants. One is the effect on cellular phenotype, which is gene expression; one is the effect on organisms, which is generally disease traits; and the third is the effect on populations, which is the selection analysis. So [there are] three different levels of biology, but we think there are ways of investigating that biology which are genome-wide, so we can scan across our map and identify those [copy-number variants] that we think are important.
Can you give an example of where copy-number variations play a role in human disease?
There is a really nice example, published in Nature Genetics relatively recently by John Armour and Ed Hollox, where they demonstrate that psoriasis is associated with an increased copy numbers of beta-defensins, which is a cluster of genes near the end of chromosome 8. That’s one of my favorite associations.
One of the important points is that it’s not a recent observation that copy-number variants can play a role in human disease. For example, it was in the early 1950s when Down syndrome was shown to be [associated with three copies of chromosome 21] but equally, in terms of common diseases, it was in the mid-80s when alpha-globin deletions were shown to protect against malaria. And that’s still possibly one of the strongest associations that we have in the field, and the most robust.
So it’s been about the technological changes, rather than any kind of intellectual leaps to find them and to assay them.
When did sequencing come into play for this project?
We were sequencing to look at the breakpoints of some specific CNVs, but this is old-style sequencing. In terms of new sequencing … Evan Eichler’s paper [in Nature Genetics] in 2005, using the fosmid end-pairs, was an excellent paper and revealed to us the value of taking that kind of approach.
But with the new sequencing technologies, it really started with the collaboration with Illumina, and the analysis of the flow-sorted X-chromosome, which has been presented at a number of conferences, in collaboration with Richard Durbin’s group [at the Sanger Institute] and David Bentley [chief scientist at Illumina] and his analysts at Illumina. So that’s really the first time that we have used the new sequencing technologies. And that’s proved pretty fruitful, and I think it has proved to us that there are a number of different ways you can analyze those data to get information on structural variation.
One is the type of approach taken in the Tuzun et al. paper [in Nature Genetics in 2005], where you look for read pairs that are anomalous relative to the majority, either in terms of their size or their orientation, when they map to the genome. But you can also look at read depth, [which] allows you to pick up things that are hard to pick up with the pairs. For example, if you look at regions that are VNTRs [variable number tandem repeats], you can see an increase or decrease in read depth, but the read pairs are very difficult to analyze because they get confused by the repetitive structure, so you don’t get high-confidence mappings. We can see that the read-depth data is actually very similar to CGH data that we look at on arrays, so we think there is definitely complementary information that you can get from those.
But there is also a third way of getting information, and that’s from using sequence-assembly type approaches, [for example] using sequence assembly across breakpoints to identify the precise location of a structural variant. So we are seeing three strands to the analysis: read depth, read pairs, and assembly.
How does the sequencing technology compare to arrays? How are the two complementary?
[Sequencing] can pick up balanced structural changes, such as inversions and translocations. It can also pick up Alu and LINE retrotransposition events. The other thing that it does, [is] it gives us positional information on where a copy-number variant has gone.
The arrays only tell you that something has changed in copy number in the genome; they don’t give you any location. And that’s [what in] sequencing especially the read pairs can give you. So you get more information about variants than you could see before, and also information about variants you could not see before. I think there is also the prospect that for a given resolution, they may be cheaper in the near future. But it’s difficult to judge that at the moment. But certainly, there are a number of projects that we are currently in the design phase of where we are having an internal discussion about whether we should use arrays or new sequencing technologies.
Are there any advantages that arrays still have right now?
I think at the moment, probably the main advantage is cost, and maybe also throughput. It depends on what level of resolution you want to look at. I don’t think we will be using sequencing for association studies in the very near future because of the need to go across thousands of samples.
What are you planning to do with the new sequencing technologies? Can you mention any projects?
It’s probably a little bit too early. You are obviously aware of the 1000 Genomes Project, and I think that’s going to be an excellent project. There is explicitly a structural variation analysis group in that, which I am one of the co-chairs of. I think that’s going to be an excellent opportunity for many of the leaders in the field to really combine forces to work out how best to analyze these data. I think we will all learn from that experience and take it off into our own projects that we might be doing. So it’s great to see the field coming together on a common dataset like that.
One thing we are not currently planning is any more large-scale discovery projects for structural variants, because we see the potential for the 1000 Genomes Project, and projects like it, to supersede that in terms of discovery in apparently healthy individuals.
We have ongoing discovery projects in disease genomes, and we have quite a large number of different association projects for structural variants going on, again in disease genomes. We see the 1000 Genomes [Project] as superseding a lot of the work that was done previously on characterizing [structural variants] because we are getting a more and more complete catalog of structural variants. And a kind of community effort like that is probably the best way to fill in the gaps that remain in our knowledge.
Are these catalogs of structural variations going to be in public databases?
Yes, they will. That’s something we are very keen on, getting the information into public databases as soon as possible. But also, giving people access to the data that we use to generate the maps as soon as possible.
So for example, in our Nature paper, we put the data on which that is based on our website before we even had that paper accepted, so people could analyze that data. Because it’s such a new field, it’s important that as many smart people look at the data as possible. Certainly, what we are planning to do at the moment, we presented at a number of conferences, we have done a second round of array-based CNV discovery that we think captures the vast majority of common copy-number variants, and in the same way as before, we are going to release those data well in advance of publication, essentially as soon as we are happy with the quality.
But clearly, there are people who are not going to want to download huge data files, such as clinicians, and we are also using much more digested forms of the data, just tracks on genome browsers that show where copy-number variants are. And that’s of immediate utility for clinicians who are interested in, for example, looking at developmental disorders where they will screen on a microarray an individual who they think may have some chromosome aberration, and what they find is a whole bunch of variants. Of course many of them, indeed most of them, will be common copy-number variants. And using these kinds of maps, they can discount those and then focus on the ones that remain, and try and understand whether they could be disease-causing. So in a diagnostic sense, these maps of copy-number variation in normal individuals are immediately used in the clinic.