Wellcome Trust Sanger Institute
Name: Richard Durbin
Position: Principal investigator, Wellcome Trust Sanger Institute, since 1992
Experience and Education:
— Research Fellow at King's College, Cambridge; postdoc at Stanford University; staff member at MRC Laboratory of Molecular Biology, Cambridge, 1987-1996
— PhD in biology, (John White’s group), MRC Laboratory of Molecular Biology, University of Cambridge, 1987
— BA in mathematics, University of Cambridge, 1982
Richard Durbin has been a principal investigator at the Wellcome Trust Sanger Institute since 1992, when the institute was founded. Last year, he resigned from his post as head of informatics in order to focus on human genome resequencing studies using new sequencing technologies.
Together with David Altshuler of the Broad Institute, Durbin chairs the steering committee for the 1000 Genomes Project (see In Sequence 1/22/2008). Two weeks ago, In Sequence visited Durbin at the Sanger Institute in Hinxton, UK, and talked to him about how the project is progressing.
What are your research interests?
What I am interested now is genetic variation at the sequence level. I have been interested in various things through my career. When I got involved in sequence analysis, progressively I realized how important evolution is, and to take an evolutionary framework for analysis of data. For quite a long time, that was comparative — looking at sequences that evolve in a long timeframe, and using that to get at function in protein sequences.
But human variation, variation within species, is intimately connected with evolution on a smaller timeframe. I am quite interested in a picture of recent human evolution, how all the various people in the world are related, and what has led to that. But also, what sorts of things happened, and how that reveals how the genome works and develops.
I am also quite interested in somatic variation and cancer, and how selection works there, to take an evolutionary and developmental view of that.
For now, it seems an obvious thing to me — and actually has done for some time — that we are at a time where human genetics is strong, and is becoming stronger. We can capture the data, we can do experiments, we can sequence more cheaply, genotype more and more accurately and cheaply. The current 1000 Genomes Project is a no-brainer project, as far as I am concerned.
Can you give a quick update on where the project stands, and how the three pilot projects are progressing?
I think they are going well. I think people believe we are on track to complete the data collection of the pilot projects on schedule, or thereabouts. We will certainly finish it before the end of the year, and I think the bulk of the primary data collection will be done well before the end of the year. We are submitting data through to NCBI and EBI — these are the joint data coordinating centers — and they are putting it into the short read archive.
There was a primary data freeze just before the Cold Spring Harbor [Biology of Genomes] meeting in May, at which time 230-odd gigabases were available. The next data freeze has happened, and the data will be released in the short read archives very shortly. That will bring the total amount of data to over 500 gigabases.
Is this data from all three pilot projects?
There is data from the first two, the trios and the low coverage [pilot projects]. And this is just the primary data.
What about the analysis?
We are expecting a release of analysis of at least one of the trios in something like a couple of months. That will be an initial release, which will contain information on the SNPs and genotypes and some other information, probably.
How are you going to annotate 1,000 genomes — are genome browsers going to be able to display them all?
The genome browsers are certainly thinking about how to display this information. Obviously, a lot of things are shared, transferable, and really, the point of the 1000 Genomes Project is not to have specific sequences of a thousand people, it’s to capture what sorts of positions and sequences are variable between people.
I think whereas it may be possible to drill down to see a particular version of something, I doubt that we are going to want to see the 732nd genome. I think it’s more likely that you are going to want to look at a gene or a region of the genome and say, ‘in 10 percent of people, there is an extra piece here, and surprisingly, she contains an extra exon, or there is a deletion which functionally changes a region.’ And I think that at the top level, that’s the way that people are going to want to look at things. There is going to be a reference structure, and you are going to want to see what sort of variation there is in the population.
So far, what have been the challenges with the 1000 Genomes Project? Have they been technical, or have they been coordinating what many different centers are doing?
It’s really on the cusp of the technologies becoming available for large-scale stuff on this scale. The technologies have been around for a bit, but to deliver the thousand genomes, we need production scale of tens of machines running and it’s only this year that that’s really happening. The systems are still developing, they are in rapid development, and we are still learning what are good formats, and how to transfer the information into that. So scaling up, I suppose, and putting everything together is a challenge.
We heard at Cold Spring Harbor this May of whole individual genomes being sequenced using these technologies. Many people who have been involved in these projects are involved in the 1000 Genomes Project. But scaling up to hundreds of people, and probably thousands of people, is another big step, so the challenges are in developing our ability to do that, and doing it consistently.
So it’s mainly the amount of data that’s challenging?
Yes, and you need to handle things in different ways. One goal of the 1000 Genomes Project is to combine information across multiple people. And that’s something that we haven’t really done at the sequence level on this scale.
We had a yeast sequencing pilot project here that went on for several years. There, we sequenced 70 strains across two species, and we were sharing information between them. That was done with capillary sequencing, so it was also a low-coverage project. But the ways we did that are not going to scale to human — yeast is over 100 times smaller.
We need to learn how to bring together and scale up data, and analyze it and combine it, and how to use paired-end technologies optimally. We are still in an era where almost every few months, there is whole new way of collecting data becoming available.
I think we made the right decision to have a pilot year. If we look back to last September, many of the things that we are doing now were not actually available in people’s labs, but there was talk of them. Some of the things that we will be doing in six months’ time, again, there is talk of them, and people are sort of playing with them, but they are not in production. Longer 454 reads are a good example, long-insert Illumina libraries, longer reads on the SOLiD.
Has it been difficult to combine data from three different platforms?
I don’t think any of the projects that have been talked about in public yet have done that. Each of the Venter, Watson, Chinese genome, and the genome that Illumina has sequenced, which we have been involved in, all of those were done on one platform.
I think that the 1,000 Genomes Project will be novel in that regard. And we have ways, we think, of doing this, but they will need evaluation. I think many people, including the Sanger Institute, have been exploring combined strategies. I think it’s quite achievable. I don’t think it’s a stumbling block, but I think it requires work.
Another aspect that we really haven’t talked that much about that is going to be slightly more challenging than combining data across platforms is assembly of novel sequence. I think there is going to be novel sequence discovered, which requires assembly. I am quite optimistic about being able to do a lot of that with the sequences we obtain from the various types of new technologies. But as a kind of secure fallback position, we can pull out clones which cover novel sequence and sequence them by traditional means if necessary. I think that will be quite challenging, and will take quite a long time.
Do you have a guess for how much novel sequence of the human genome we can expect to find?
There are two sorts of missing sequence, one are gaps in the reference, the other are insertions with respect to the reference, the pieces which were not there in the reference sequence but which are there in some other sample. I don’t think it’s a vast amount in any one sample — it’s probably on the order of a percent, or a few percent.
Besides getting a catalog of variation between humans, what can we learn from the 1000 Genomes Project down the road, and what other sequencing projects will follow it?
One thing that is certainly beyond a catalog is, we are going to learn about the linkage structure of all these variants. Which ones tend to come with each other, the haplotype context. That’s something that HapMap has given us quite a lot of insight into, and we know that those things are important for interpreting genotyping data. I think the 1000 Genomes Project is specifically aimed at and designed to deliver that sort of information on a comprehensive scale.
I think it’s also a substrate for finding out more about sequences being under selection, functional variation in humans, including medical variation. Developing alongside this project is focused resequencing, either in exons or other functional regions of the genome, using these pull-down type strategies. That will be applied, on a large scale, to thousands of medical samples. It is happening already, and will become the predominant thing in the next phase.
For example, the Wellcome Trust Case Control Consortium has a follow-up component [see feature article, in this issue], which we are involved in, where we are doing some of this around the hits from the project that was reported last year. I think there are other such projects.
Also, the third pilot of the 1000 Genomes Project involves this type of sequencing.
Then, I think, we will enter a period, as costs drop even further, of whole-genome resequencing of samples with mostly medical phenotypes.
How are the sequencing technologies going to get there?
I think they will get better and better. I think, realistically, we are at about $100,000 a genome at the moment. I can see a way down to $10,000 in the next few years. I can’t actually see the way to $1,000 a genome right now, but I believe that that will be achieved.
The raw data rate is quite challenging. I think that people are going to have to really work on the image capture — I think image capture will become limiting. The way that the data is captured is through images, and I think it’s really information. The reason why sequencing can get twice as good every year — it’s moving even faster than that right now — is for the same sorts of reasons that Moore’s law has applied in computing. It’s because what you are dealing with is really information, it’s not anything physical.
You can’t make cars twice as rapidly and cheaply every year; you have to end up with a piece of metal, and somebody has to get the metal out of the ground and you just can’t scale those things. Whereas you can scale things where all you are dealing with is information. You can shrink the amount of energy and space and time taken to handle a piece of information. But that means that we are going to have to focus more and more on the information handling side of it for the primary data capture.
I can sort of see the way down to the $10,000 genome, using current CCD cameras attached to these microscope-like devices that we have. I think the off-the-shelf components that we have now will get us there. And you could say, just wait another four years, or however long it takes for CCD technology to develop, and maybe that’s what will happen. But I think we will become more and more demanding in primary image capture. There are all these companies there, and clever people thinking up ways to collect the data, and that’s why I am kind of confident we will get there.