Name: Steven Salzberg
Title: Director of the Center for Bioinformatics and Computational Biology and professor of computer science at the University of Maryland, since 2005
Experience and Education: Senior director of bioinformatics at the Institute for Genomic Research, 1997–2005
Assistant and Associate Professor of computer science, Johns Hopkins University, 1989–1998
PhD in computer science, Harvard University, 1989
Masters in computer science, Yale University, 1982
BA in English literature, Yale University, 1980
Steven Salzberg developed one of the first computational gene-finding programs for the human genome, Morgan. He has also developed numerous other gene-prediction programs that have been used to analyze hundreds of bacterial, viral, plant, and animal genomes. More recently, Salzberg and his research team have also developed a suite of algorithms compatible with short-read sequencing technology for genome assembly, alignment, and analysis, all of which are open source and freely available.
In a perspective published in Genome Research last month, Salzberg highlighted the challenges of assembling large genomes with next-generation sequencing data. In Sequence spoke with him recently about these challenges, the quality of genome assemblies, and the algorithms and tools used to assemble genomes.
What are some of the problems with short-read assembly?
The problem of genome assembly itself hasn't fundamentally changed. You still sequence the genome by doing a whole-genome shotgun sequence — that is, you break it into lots of tiny fragments and you sequence those fragments. What has changed is that the sequence read lengths are shorter than they used to be. The other thing that's changed is, the ability to capture paired ends has declined. There are paired-end protocols from the major sequencing companies but they don't work very well.
To compensate for that, we are collecting deeper coverage. Sequencing is so cheap that you can actually get 50x coverage for much lower cost than you could get 8x coverage using the older technologies.
But one problem is, the deeper coverage only partially compensates for the other deficiencies in read length and paired-end information. Paired ends are the bigger problem. So these two things — read length and paired-end information — are critical in producing a good genome assembly, no matter what software you're using.
What's wrong with the paired-end protocols?
Well, all the vendors offer it, but then when you see what they actually produce, it's not that good. When you assemble a genome, anywhere where there is a repetitive sequence, the assembler is likely to create a gap, especially with short reads. All genomes have a substantial amount of repetitive DNA. Everywhere where there is a repeat that is longer than the read length you get a break between contigs in the assembly. The shorter the read length, the more breaks.
With Sanger reads we were getting read lengths of around 800 base pairs. The most common repeat [in the human genome] is about 300 bases long and is called an Alu repeat. They're everywhere, but because the read lengths were longer, you would just read right through them, so you could place them, because on either end of the Alu you'd have unique sequence and it would all be in one read.
With paired ends, you have a longer fragment of DNA, and you sequence both ends. Then, you keep track of what the length of the fragment was, which tells you how far apart the two reads should be when you finally produce the assembly.
So, if you have paired ends, then you can say, 'OK, I have a read and it's repetitive, but its mate is not repetitive, so I can place that one uniquely. Once that one's been placed in the assembly, I can use that information to place the repetitive end.'
All modern assemblers use this information extensively in trying to put the genome together. So in general, we want paired ends that are as far apart as possible because the further apart they are, the bigger the repeat they can span.
In mammalian genomes, there are repeats on the order of 1,000 to 3,000 base pairs long, and there are lots of them. You need a mate pair library that is several thousand base pairs long at least, and the only protocols that make paired ends that are 3,000 base pairs apart involve the circularization of DNA.
The problem is that DNA doesn't want to circularize when it gets that long. It's hard to do. So you don't get very many circles, and then there's an amplification step. So you end up sequencing the same circle more than once. There's very high redundancy in the library.
If you read the panda genome paper [published in Nature in December by BGI] and what they say about their long-range paired ends, they say only 25 percent or less of their paired ends were unique. And we've seen cases where you may have much less than that. You may have one tenth or one twentieth.
So, the longer fragments are better for assembly, but the longer the fragment, the more difficult it is to circularize. Someone needs to come up with a better protocol.
So the problem with assembly is less a problem of the assembler algorithms and more a problem with the paired-end protocols?
Well, it's a problem for the assembly algorithms. We're still being asked to assemble genomes with this data and we're doing the best we can, but there's an inherent limitation in the data and we can't magically fix that. If you have a repetitive sequence and you don't have any paired end information that crosses that repeat, then you simply don't know what goes on either side of it. Despite that, we're assembling genomes left and right.
Are the genomes that are being assembled of a good quality?
We're getting some assemblies that are surprisingly good compared to how short the reads are. They're not good compared to the Sanger assemblies, but they're good enough to be useful for the biologists working on the problems.
We're at a point right now where none of the assemblies are comparable to, say, the dog genome from five to six years ago. The dog genome was done with Sanger sequencing. Of the genomes done using short reads, the panda is probably the best, and really the first mammalian genome we've done entirely with short reads. It's a useful assembly in that you can find most of the panda genes. But it's in an awful lot of pieces, something like 200,000 contigs.
It's highly fragmented, but it's not clear you're going to do a lot better than that with short reads. With better mate pair information, more mate pair information, and longer reads, it will get a little better.
When evaluating an assembled genome, how do you determine its quality?
We look at N50 lengths as our first metric — the N50 [for] contigs and scaffolds. The contigs tend to be really short with these next-gen short-read assemblies, so those lengths aren't very impressive. If you have enough paired-end information, the scaffolds can be quite large. You might have contigs that are 10,000 base pairs or less, which is pretty short, but scaffolds that are hundreds of thousands of base pairs long. If you have enough mate pair info and deep enough coverage with mate pairs, then you can make pretty big scaffolds.
The other thing we look at, which is sometimes not given sufficient weight or attention, is correctness. It's harder to measure, which is why it's sometimes glossed over, but it's incredibly important that the assembly be correct. Otherwise, what's the point?
With a draft assembly it's very hard to know if an assembly is correct. An assembler is a very complicated program and they work differently in different people's hands. It's like if you're flying an airplane, say. It's a very complicated machine; there [are] a lot of things you can do to fly it differently.
Assemblers are like that. They are huge programs with hundreds of modules and many parameters you can adjust. You can make them more aggressive at putting together bigger contigs and scaffolds, and it seems like that's what you want. But as you do this it starts to make errors. It puts together contigs that aren't representative of the true genome. And if you don't have a reference genome, you don't have any way to check it. So you have to be careful.
We look for independent sequence information or marker information to validate an assembly. For example, you might do RNA-seq. The transcripts ought to map across the exons, and align in a consistent way. All the exons should be in the right order and orientation. If you find that they're inverted or split into completely different scaffolds that can't possibly fit together, then you know you've got a problem. In big genomes there will always be some cases where the transcripts indicate assembly problems. RNA-seq is a useful technique for doing validation of the assembly.
How do the different assembly algorithms compare?
The algorithms are not comparable. They're quite different. One thing people should be aware of is that some assemblers will only handle one type of data. Newbler will only handle 454 data. SOAPdenovo only handles Illumina, and Velvet also only handles Illumina.
The Celera [assembler], which we use, will handle all data types. It's currently the only one I know of that will handle a mixture. It's not the fastest assembler, but it's a good assembler. If you're using SOLiD data, none of the assemblers work. Applied Biosystems claims to have an assembler of their own that will assemble genomes with SOLiD data, but there isn't an open source or freely available assembler that will assemble SOLiD data.
The assemblers available are different depending on what data you choose to use and on the size of the genome. For bacteria, in our experience, Velvet is better than SOAPdenovo. For bigger genomes, Velvet just doesn't work. It runs out of memory and crashes.
There's also an assembler called Abyss. It's in the same category as SOAPdenovo — it works on big genomes from Illumina reads. We haven't run Abyss ourselves. But compared to what's been published about Abyss, SOAPdenovo is a little better.
If you really want the best assembly, you can run multiple assemblers. If we have a 454 data set, we use both Celera and Newbler, then compare, and maybe try to merge them.We also do that with Illumina data. We run the Celera assembler and SOAPdenovo. If it looks like one is better on some aspects of the assembly, and the other is better on other aspects, we can try to merge the two assemblies.
Also, we don't just run the assembler once. We run it multiple times for a genome because it has many parameters that you can adjust. One of the things we do is adjust the parameters to make the assembler more or less aggressive about putting together contigs and scaffolds.
We also find that in a large majority of projects, there are problems with the data. We discover, for example, that the 8-kilobase library is actually only 3 kilobases apart. So the assembler messes up because you told it that it was 8 kilobases apart, and it's not. So, we go back and figure out what is the real mate-pair distance, and set that to be the correct distance. Or we find reads that are oriented the wrong way, or chimeric reads, so we delete those reads, or fix them. You can start these assemblers in the middle and go forward, so we can rewind to the place where we have to, and then restart. I wish it were real simple, that you just pressed the "go" button, but that has never happened.