Research assistant professor
Washington University School of Medicine
Name: Michael Wendl
Position: Research assistant professor, Genome Center and Department of Genetics, Washington University School of Medicine
Experience and Education:
— PhD in engineering and applied Science, Washington University, 1994
— MS in mechanical engineering, Washington University, 1990
— BS in mechanical engineering, Washington University, 1989
An increasing number of research projects, including the 1000 Genomes Project and the International Cancer Genome Consortium, are sequencing, or planning to sequence, human genomes using new short-read, high-throughput technologies.
Last month, Michael Wendl and Rick Wilson from the Genome Center at Washington University School of Medicine published a paper online in BMC Bioinformatics in which they assessed the redundancy that the new technologies require to be able to detect sequence variations, such as somatic mutations in tumor cells, comprehensively.
In Sequence spoke with Wendl last week about their model, and how it can inform large-scale sequencing projects.
Can you briefly explain what you set out to do in your paper?
The short summary of this is that now people are interested in sequencing individual diploid genomes to find variation and SNPs, and in particular, to characterize somatic mutations. The fact that you really have to do diploid sequencing in order to find heterozygous variation means that the previous models that people have relied on for a long time, in particular the Lander-Waterman model, really are not applicable to these new scenarios.
What is driving most of this are these new technologies, like the new generation Illumina machines, and the ABI SOLiD, and all the other platforms that are coming out now. So there really was sort of a lag in the theory to predict what sorts of coverages would be needed for these new types of applications.
From the point of view of coverage, is there a fundamental difference between short-read and long-read technologies for these projects?
The coverage behavior is a little bit different; long reads will generally cover more efficiently than shorter reads. The assembly problem is different as well, because now with the human genome reference, with the new platforms, assembly really means more alignment against a reference rather than de novo assembly. So there are quite a few differences. And what this paper looked at was just the coverage behavior.
What were your main findings?
The main findings, that I think were a little bit surprising to a lot of people, is that the depth that you need to sequence is much higher than what people had thought intuitively. For example, we were used to sequencing, in the old days of BAC clone sequencing and whole-genome shotgun sequencing, to a redundancy of between 8X and 10X.
With the newer platforms, on diploid sequencing, what we found was that the depth required is on the order of about 25X, which is much, much more than the typical 8X to10X BAC sequencing. There is a paper in review now about the sequence of the first [acute myeloid leukemia] genome (see In Sequence 5/13/2008). That work relied on the predictions in this particular paper.
Why is so much more redundancy required with short-read technologies?
It has to do with two things: First, with the fact that you are sequencing a diploid genome, so you are covering a particular position on each chromosome.
The other part is that people are now interested in covering a particular position more than one time. In, for example, the Human Genome Project, coverage meant that if there was at least one read spanning a position, that position was considered to be covered.
Now, with this new hardware being capable of giving you much more redundancy, the standard seems to be developing that each position is covered at least twice, rather than just once.
Is that because they have higher error rates than traditional sequencing methods?
It’s partly related to error, it’s partly related to the fact that the short reads are sometimes harder to align uniquely, because one particular read may align well to several parts of the genome. And I guess trying to cover each position twice, you could think of it, maybe, as a proxy for controlling things like error and difficulty to align.
Do your results apply to all short-read technologies?
This paper, really, was just a very high-level study of the probability of coverage, given only a few parameters, like read length and genome length, and things of that sort. It doesn’t consider any machine-specific aspects, any software-specific aspects, any sorts of bias in the sequence. All of these things would tend to make the coverage process less efficient. So you could think of this paper as describing the ideal coverage process.
So do you assume 100 percent accuracy, or is there an error assumption in there?
Well, it doesn’t really have an error assumption built in. We considered that by, basically, using the theory that we derived as a calibration tool.
For example, in BAC sequencing, people considered 8X to 10X to be enough to compensate for error and bias and all these different sorts of things. So what we did is, we took this model, and we said, in the ideal case, if we were to want to cover a diploid human genome, if we wanted the coverage process to behave as if we were covering a BAC at 10X haploid redundancy, what would we need? So, in a very implicit way, it considers things like error and bias, in the sense that we were relying on the empirical knowledge of having sequenced hundreds of thousands of BACs, and through that, we found that 8X to 10X is about right for haploid coverage.
So if we translate that sort of empirical knowledge into this new diploid sequencing, using the model as a calibration tool, we come up with about 25X to 30X.
What are your recommendations for researchers considering diploid human genome sequencing projects?
According to this model, it looks like the redundancy should be in the range of about 25X, which is more than some of the first [human genome sequencing] papers that came out have. For example, the Craig Venter sequence, which I believe was done with traditional 3730 hardware, was about 7.5X redundancy (see In Sequence 9/4/2007). And of course there were a lot of things that were missing. And then, of course, the Jim Watson sequence paper, which just came out not too long ago (see In Sequence 4/22/2008), where they used 454 technology. That, also, was only about a 7.5X sequence.
We kind of feel like the diploid genomes that will be done in the future will be much higher redundancy. I believe a lot of the future projects will cover genomes to a much deeper extent than what the traditional numbers have been. I guess our expectation is that people will take this model and use it in the same context as people have used the Lander-Waterman model for haploid sequencing.
Would it also have applications for the 1,000 Genomes Project?
We are now thinking about this in the context of the 1,000 Genomes Project. [That project] actually opens up some new considerations, because you are sequencing many, many genomes in the hopes of finding variants that are extremely rare in the population. We are actually now using this model as a stepping stone for a more sophisticated model to try and figure out how to most efficiently find rare variants in a population.
This really boils down to an optimization problem. You can think of, on one extreme, sequencing just a few genomes very deeply. Of course, on the other extreme, sequencing many, many, many genomes to a very low redundancy. In either of those cases, you are not likely to find the very rare variants you are looking for, because in one sense, if you only pick up a few genomes and sequence them heavily, it’s unlikely that the variant was in any of those to begin with, and on the other hand, if you sequence a whole bunch of genomes, but each one has a very low redundancy, you are likely to have the variant in your sample set, you are just not likely to cover it with sequence.
So somewhere in between, there is an optimum, where you get the highest probability of finding variants. This is the sort of thing where we are going to use this particular model in this paper as a stepping stone, to look at this more difficult problem of finding variants.