Professor of Computer Science
University of California, San Diego
Name: Pavel Pevzner
— Professor of computer science, department of computer science and engineering, University of California, San Diego, since 2000
— Howard Hughes Medical Institute Professor, since 2006
Experience and Education:
— Professor, departments of mathematics, computer science, and molecular biology, University of Southern California, 1995-2000
— PhD in physics and mathematics, Moscow Institute of Physics and Technology, 1988
— Undergraduate degree in applied mathematics, Moscow Institute of Railway Engineering, 1979
Pavel Pevzner, a professor of computer science at the University of California, San Diego, has been thinking about new methods for the assembly of DNA sequence data since at least 1989. Earlier this month, he and his colleagues described a new algorithm, Euler-USR, for assembling mate-paired short reads, in a paper published online in Genome Research.
In that publication, they also explored whether read length matters for assembling genomes de novo from paired-end reads, and how to reduce errors at the ends of reads computationally. In Sequence spoke with Pevzner last week about his research on genome assembly.
Can you briefly describe what Euler-USR can do, and how it differs from other short read assemblers?
As you know, there was really an explosion of papers on the de novo assembly of short reads in the last year. But the kind of algorithmic developments that would allow us to address this problem actually started 250 years ago. They were started by a great mathematician, Leonhard Euler. As he walked through the town of Königsberg, at some point, he asked the question whether he can arrange a trip in such a way that he visits all seven bridges in the city exactly once. And that gave rise to the so-called Eulerian cycle problem. There are thousands of papers — in computer science, not in biology — written on Eulerian cycles.
This problem has a very close relationship to fragment assembly, and in fact, the first paper covering this relationship was published almost 20 years ago — I published it in 1989 for application to sequencing by hybridization. This technique for fragment assembly did not materialize because experimentally, it was not as accurate as developers of sequencing by hybridization wanted it to be.
We realized that Eulerian cycles and this different type of assembly should also work for short reads. But we realized it well before short reads actually came up: We published a paper in PNAS in 2001 where we argued that even Sanger reads should be assembled with this Eulerian technique. But the Eulerian technique is quite different from the techniques that ruled the world at the time, like the Celera assembly [algorithm], for example. So we basically looked like lunatics at that time because people did not understand that it has advantages. It was very foreign for most people — I don’t think anybody took it seriously.
So we were very pleased to see that this year, all newly developed assemblers — Euler-SR, Velvet, Allpaths, Newbler — are Eulerian assemblers. And the traditional Celera-like approach seems to be forgotten now; there are no new developments along the lines of that.
And our new paper, written with Mark Chaisson, a great postdoctoral student in my lab who pioneered the algorithmic study of short-read assembly, basically follows this paradigm. He published the first paper on this in Bioinformatics in 2004, before short reads came out, in preparation for short-read arrival. By that time, it was clear that they will arrive.
If you look at these papers published this year, they are quite similar, ideologically. I think it became completely clear that the Eulerian approach works for assembly; it was not quite clear how to deal with these mate-paired [reads]. And what [our] paper that just came out shows is that once again, everything in this world is not new, that the approach that we proposed in 2001 and published in Bioinformatics in 2004 actually works very well for mate-paired information. And the way the Eulerian approach processes mate pairs actually turned out to be extremely useful for assembly. And it is so good that it even raises the question: ‘Does the read length matter?’
Eulerian setup for fragment assembly does the following trick with these mate pairs: it takes a mate pair — a read on the left, a read on the right, and nothing in between except for the known distance between them, but you don’t know what’s exactly between them. The Eulerian approach actually fills in the gap — it tells you what is inside. Now, if you can fill in inside, then instead of short reads you immediately have a long read. Then, you just transform every mate pair, if the distance between reads is, let’s say, 1,000 nucleotides, into a virtual read of length 1,000 nucleotides. And then you are free to assemble these long reads. If you can do this transformation, then the read length does not matter. The only thing that matters is the span between mate pairs.
But this argument only holds if you really can do this transformation for every mate pair. And that’s where the intricacy of this algorithm comes in. You indeed can do this transformation for the overwhelming percentage of mate pairs, but at some point, some mate pairs become very difficult to generate this transformation for. We basically describe an algorithm to efficiently transform mate pairs and investigate in which cases this transformation is possible, and in which cases it is impossible. And what we found is that there is a certain thing that we call ‘read length barrier.’ After this length barrier is reached, further increasing the length of the read doesn’t really help, because the efficiency of this transformation is not improving. It’s already reached the maximum of what is possible theoretically.
So for practical applications, what is the read length barrier for different kinds of genome sizes or genome complexities?
The surprising thing that came out is that assembly of bacterial genomes with short reads results, actually, in very good assemblies, even with 35-nucleotide Illumina reads. Ilumina already has 75-nucleotide reads, but this progress from 35 to 75 nucleotides, with the currently most popular distance between mates — which is around 200 to 250 nucleotides — actually won’t lead to a significant increase of the assembly. The assembly is already very good.
So when you ask the question, ‘What is the read length barrier?,’ then this read length barrier is also dependent on the distance between mate pairs. For every distance between mate pairs, there is a read length when increasing the length doesn’t matter. But of course, the read length barrier, as we mention in the paper, depends both on the distance of the mate pair somebody chooses for the experiment and the length and complexity of the genome.
And this remains to be further investigated. For example, a very intriguing question is, ‘What is the read length barrier for mammalian genomes?’ If we prove that the read length barrier, let’s say, for a mate pair length of 5,000 nucleotides, would be 100 nucleotides — and I don’t rule out that it may be close to 100 nucleotides — imagine then an Illumina assembly with short reads of length 100 nucleotides will be nearly as good as an assembly with superlong reads of 5,000 nucleotides, and it starts to become a computational question, to some extent.
What is your best guess today for what the read length barrier is for the human genome?
That’s the problem we are currently working on, and it is a difficult problem for the following reason: Currently, all assemblers described so far need to be further improved to start assembling mammalian genomes. Therefore, we cannot simply run our assembler until then. … But many people are very interested in this, and of course, companies are very interested in this. I got enormous interest from companies developing short-read technology in answering this question, and that will be quite interesting.
And that can explain why I’m not convinced that as soon as long-read technologies like Pacific Biosciences arrive, they will immediately bury short-read technologies. I don’t think it will be absolutely clear that it will be overnight. Now, people somehow feel that the main advantage of the technology of Pacific Biosciences is the read length, and it is indeed a great advantage, but this advantage is mainly for assembly or for haplotypes.
Let’s focus on assembly, because for haplotypes, this technology indeed has an edge over short-read technologies. The key question is, what is the cost for, let’s say, short-read technologies and long-read technologies? If long-read technologies have lower cost, then they win anyway, because people will simply start to do mapping and resequencing with long reads.
But I suspect there will be a period of time, and maybe a long period of time, when the cost of long-read technology will be comparable to the cost of short-read technologies. And during this time, it is not clear to me who will be winning, because during this time, I am sure this idea of read length barrier will materialize and people will start assembling complex genomes with short reads, maybe using various libraries. Computational aspects will be worked out very quickly within a year, maybe. And then people will see that assembling large genomes with short reads is not an impossible goal. And then the only thing that matters is cost. It doesn’t matter that short reads are short, because we can transform them into reads that are arguably longer than the long reads produced by third-generation technologies. That’s my view — I don’t think that this introduction of long reads will immediately make a difference.
And long-read companies, I think, will have some homework to do to really demonstrate that the technology has great advantages in de novo sequencing. And also, it is not entirely clear how to deal with the error rate of this technology. There should be proofreading, let’s say through circularization as Pacific Biosciences performs on paper, but the original reads are quite inaccurate.
You mention in your paper that your approach could help those technologies that have poor error rates by helping to correct the reads. Can you explain that?
This paper, actually, has two messages that are completely different. The first message is how to deal with these mate pairs. The second message is, we tried to push the benefits of short-read technologies to the limit, and we asked the question, ‘Suppose technologies generate short reads that are too short — below the read length barrier — can you do anything?’
And what we saw is that, actually, there is a possibility to extend the read length computationally, rather than experimentally, by more intelligent processing of sequencing data. If you look at Illumina reads, the ends of Illumina reads are very inaccurate, with a 20-percent error rate. And we demonstrated that we actually can fix errors, computationally, in the ends of the reads, and make them virtually error-free.
For example, suppose that you work with Illumina reads of length 35, and your read length barrier is 50, so it looks like you cannot do anything. But Illumina reads are actually not 35 — the reaction can be extended. The problem is that the error rate became so high that no tool today can work with such huge error rates. Our tool, for the first time, allows you to work with these reads that have an enormous error rate in some portion, let’s say 20 percent, but it may be even larger than 20 percent. So this is a solution for how to make extremely error-prone sequencing work. And it’s also maybe useful for long-read technologies, indeed, because their original reads are not as accurate.
What is this error correction based on?
Once again, we have to go back in history. The error correction procedure is not new. It was proposed in our 2001 PNAS paper. Unlike the Eulerian idea, which was not immediately picked up, the error correction idea was immediately picked up. Now, essentially every assembler uses the error correction idea that we proposed in 2001. And this idea essentially does the following: Suppose you have an error in the read. In the past, people tried to use excellent programs like Phrap to get rid of this error through quality values, and it was very useful for Sanger reads. But arguably, even more useful would be to use other reads to correct errors in this read because other reads provide proofreading of the same position many times. So back in 2001, we demonstrated that this proofreading can dramatically reduce errors in original reads, by a factor of 50.
So it requires a certain amount of coverage, basically?
It does, but you have all the coverage in the world when you run short-read sequencing projects. And once again, all the tools for short-read sequencing also use error correction.
So for the users of the Ilumina or SOLiD systems today, do you recommend they should extend the reads beyond what the companies recommend?
Our tool will allow them to extend the reads, but extension of the reads only makes sense for people who do de novo assembly.
Have you, or has anybody, applied your algorithm yet to assemble a large mammalian genome?
No. They are currently working on plant genomes. There were no published reports yet on real de novo assembly of mammalian genomes. But there is very good progress in this direction, for example at the University of British Columbia (see In Sequence 8/5/2008), but we are still waiting for the papers to come out on this.
But when I talk about real assembly, I mean taking the genome of the elephant, and I say that will be a real de novo assembly of a mammalian genome. With human, it will be always not exactly de novo; it will be just filling in the gaps, which will also be very useful. But in this sense, de novo assembly of the human genome, which means sequencing of variable regions of the human genome, is definitely a possibility even today. It requires, basically, technical developments related to memory required for this application, and other things.
What is your prediction for next year? What’s going to happen with the assembly of large genomes?
I think next year, people will really start assembling large genomes with short reads. And it will be either experimentation [with] these as a distance between mate pairs, or of these two-level libraries of different sizes, but I see no reason why short-read assembly with these carefully designed paired-reads should not be as powerful as Sanger assembly. And as you know, people can assemble genomes with Sanger reads.