As next-generation sequencing instruments gain traction in the market, a number of research groups are busy developing new assembly algorithms and analysis tools to handle the unique challenges of the data coming off these machines.
One such effort, at the Department of Energy’s Joint Genome Institute, is using a whole-genome shotgun assembly program called Forge to assemble reads from 454 Life Sciences’ Genome Sequencer.
Darren Platt, head of the informatics department of JGI, developed Forge, which was used to assemble the Neanderthal genome published by researchers at JGI and the Max Planck Institute for Evolutionary Anthropology in Science last November.
BioInform recently caught up with Platt to discuss Forge, as well as his general outlook on the state of bioinformatics for next-generation sequencing. A transcript of the discussion, edited for length, follows.
Can you tell me about the Forge assembly algorithm, and some background on how and why you developed it?
I developed [Forge] probably in early 2000. It was originally designed while I was working at Exelixis and it was initially used on the Ustilago maydis genome, which is a fungal genome. And there we looked at all the available assemblers, and we didn’t have a machine with a very, very large amount of memory in one machine. We had a very traditional cluster with a lot of smaller machines with a little bit of memory, so it was written to take advantage of that environment so that it could use many, many computers and combine the memory. We used that initially on this small fungal genome, which was too big to use Phrap on, which was the next-best alternative at the time.
Fast-forward many years, and I’m working for a genome center now and JGI has a very respectable in-house assembler, Jazz, that’s been used for the majority of the production projects here for years, but then there’s this question of what do you do with 454 data, because it has a very different error model.
When I wrote Forge, I made the alignment process very quantitative. A lot of assemblers have fixed alignment penalties for gaps, but [Forge] actually uses the quality of the underlying data to do the alignments. And this made it relatively easy to adapt and to handle 454 data. And there were a couple of very specific algorithmic changes that were needed to handle the 454 information, because it’s well known that it has these very characteristic errors in homopolymers. … And basically, we were able to demonstrate that it worked pretty nicely on 454 data.
It’s also pretty well known that the 454 assembler itself was done in what is called flowspace, which is a slightly different representation for the traces, where if you have a string of one base in a row it’s represented as a single peak rather than as a separate set of bases. I took a much more traditional approach to this and just treated the 454 data exactly the way you would a Sanger sequence, so when there are four As in a row, there would just be four As in the sequence, and you’ve just got to be careful with the alignment.
One of the nice side effects of that was you can also just throw Sanger sequence in as well, and you can combine Sanger and 454 data. And we very quickly found ourselves here at the JGI with a lot of these types of projects on our hands, where there’s a lot of 454 data and Sanger, and it’s been good having Forge because we can combine all of those data types simultaneously into one assembly.
If it was possible for you to modify an assembler that was originally developed for Sanger data, then why are some other developers writing new assemblers for next-generation sequencing data?
If you’re starting from scratch and you’re only working with 454 data, flowspace is very elegant. The problem is, once you start mixing in Sanger data, it gets very complicated. … Everybody’s tackling different problems, and I guess the space that I’ve carved out with Forge at the moment is the hybrid assembly.
The most recent project we’ve completed — and we’re writing this up at the moment — is an oomycete pathogen called Phytophthora capsici. This is in collaboration with the National Center for Genome Resources in New Mexico, and this is a pretty big assembly by 454 standards. The genome is around 60 megabases and there were 15 million reads, which were a combination of a lot of 454 data — about 20X coverage — about 5X coverage of Sanger data, and also 454 pairs.
These 454 pairs are really, really short. You have only 18 to 20 bases at each end, so in some ways it’s very similar to what the next-next-generation platforms are producing. It’s so tiny, you’re really just looking at little pinpricks in the grand scheme of things, so again we have to make some small modifications to allow all this data to play nicely together so that Sanger reads are these giants and the 454 reads are in the middle and then the pairs are the little tiny guys.
Are you working with any other next-generation sequencing vendors besides 454?
The vendors have been wonderful. They’ve all been really forthcoming and open about data and data formats and giving us early access to data before the machines exist. So we’ve been working with Solexa [now Illumina] on their platform and we have a Solexa machine here, and we’ve been working with [Applied Biosystems], and they’ve been very good about providing us with test data sets from the SOLiD platform. Basically anybody who has data, we’re interested in looking at it.
I think a couple of challenges with what we’re calling the next-gen 2 platforms — Solexa, ABI, and Helicos — are that the reads are really short and there are very, very many of them, so it’s a challenge for the alignment algorithms to resolve the genome structure from pieces that small. And you need really, really good pairing for that to happen, so the thing we’ve really been waiting for is for paired data sets to come up, and … those data sets don’t really exist yet.
The second problem with the next-generation platforms is that even if you have an algorithm that really works, the scale of the data is just terrifying. If you do 100X coverage of something like E. coli, you can be playing around with 10 [million] to 20 million reads, and that’s about the same number that you would normally put into a whole mammal, for example. We’re thinking of doing plant genomics in the future, so we’ve calculated that you can end up with a billion reads. And we don’t have computer systems that are large enough.
I think one of the reasons I’m interested in pushing Forge in this direction is going right back to its origins. It was designed to use many, many computers together and not all of the memory in one machine, so in theory it might stand a better chance of scaling up.
But that’s conjecture, really. We’re still playing around with bacteria and thinking about what problems we’re going to run into as we scale up. I’d be thrilled right now if we could do a bacteria happily with just 25-basepair reads, and we’re not quite there yet.
So the increase in reads requires more assembly calculations, which is why you would require more computational power?
Right. It’s like a jigsaw puzzle. The picture of the little Swiss cottage stays the same, but if you have a 100,000-piece jigsaw puzzle, that’s different than doing a thousand-piece jigsaw puzzle. There are more pieces that you’ve got to compare with one another, more possibilities for laying them out, and also just storing that amount of information.
The reads are smaller, so there are more of them, but we also think we’ll need more coverage. Typically, we would sequence a genome to about 8-fold coverage. But with these new reads, you might need 50-fold to 100-fold coverage, so you’ve gone up probably 12 times deeper and you have 10 times more reads lengthwise, per unit distance, so it all just sort of pushes the jigsaw puzzle through the roof … and suddenly you need a large disk just to hold the input dataset, let alone to move it around and compare it.
One thing you mentioned was visualizing the reads on these different scales. Do you find that visualization is particularly challenging with next-gen sequencing data?
Yes. The original version of Forge was written on a Dell 300 MHz laptop and I could assemble a whole genome and look at it and play with it quite happily, and that was a modest-sized fungal genome. I struggle now to hold an E. coli dataset on my much newer, bigger, fatter laptop, just because 30 million reads … [becomes] a high-performance computing problem just to retrieve the information, and then you can’t show [users] all the information and have them make sense of it.
I’d imagine, for example, that for finishing in the future, people aren’t going to want to examine every single read and compare it to the consensus. They’re going to want a pretty high-level representation that takes them to problem areas and gives them options for resolving it. So the database infrastructure to retrieve that and organize it and support it is non-trivial.
The traditional bioinformatics Perl, CGI, GIF certainly doesn’t scale up to the level required to display it. You can’t fit it on the screen even if you draw every read as a single pixel. There are just not enough pixels on the screen to get all the data on there, so you’ve really got to think about simplification.
What are the alternatives that you’re considering?
For now, my strategy has been to just use bacteria mainly for testing, and then just walk away from the laptop for a little while and come back when it’s loaded, or enable a tool that can do peephole visualization so that you can just go in and look at one specific part of the assembly.
I think that if you want to start looking at larger assemblies, you need to throw [some of] the information away. So what you might do on the back end is summarize it and say, ‘Look, for a lot of these scaffolds … I’m going to flag that whole region as good and just draw it as a line on the screen.’ Nobody’s ever going to want to look in there and really even inspect it. And then there may be some areas where there is a lot of disagreement and breakage and mispairs, so you might want to visualize just those regions and get a human involved, but the question is whether even a human can resolve something with 50 million to 100 million reads in it.
I think the good news is that the algorithms will get smarter, and there’s so much data that you won’t have that much ambiguity any more.
I have sort of a toy viewer that I built in my spare time using an experimental programming language and the graphics chip on my laptop. There has actually been a lot of research in the gaming community into really high-performance graphics, and we traditionally don’t use them a lot in bioinformatics visualization, which is more web-based, but you can actually do really nice things using the [graphics processing unit]. You can load the data onto that and it starts rendering fairly large data sets at fairly high speed, but you still can’t get 100 million reads onto your desktop.
I’m aware of people using GPUs for speeding up algorithms like Smith-Waterman, but not for actual graphics rendering in bioinformatics.
I think there’s actually a lot of opportunity to do more exciting graphics on the desktop side again, because it really is a high-performance computing problem. If I can just open a small segment of an assembly at 50X, then I can be rendering 10 million, 20 million polygons pretty quickly to draw it all, and that sounds like a lot of work, but for your average GPU, that’s just one scene from Half Life, so it doesn’t break a sweat.
I think there are cultural reasons why bioinformaticians don’t make better use of GPUs. I think we have a very open source, platform-neutral world, which is a good thing. It means that Macs work, and Linux works, and Windows desktops work. But it’s very hard to write desktop code in that environment. So my tool works beautifully on my laptop, which is a Windows machine, but it wouldn’t work on a Linux machine or a Mac. The tradeoff is that you’d have to rework it for each of those platforms.
In terms of visualization, it seems like it might be getting difficult to layer more and more of this information on top of the traditional linear representation of the genome.
I also think we don’t give humans enough credit for their sophistication. At the last company I worked we did some experiments with Blast visualization, and rather than doing just rectangular blocks for each Blast hit, we actually did a mini-pixelated representation of the alignment — every insertion, deletion, and substitution. And at the first level, it looked like someone had just vomited on the display — pixels going everywhere. But actually, after a day of using it, [the users] did get very sophisticated at looking through Blast results very quickly, because you’re actually handing them more information. But it’s overwhelming the first time you look at it.
If you go back to just the rectangular, simple bar of color, it’s a very poor substitution after you get used to the richer version. So I think there’s definitely room to push on some of these things a bit more.
The same thing with assembly. We think of that as placing all the reads along a line and then looking at it, but wouldn’t it be nice if you could visualize all the underlying graphs — where one part of a genome is similar to another part, you should see all those connections.
So what’s next for you? What are the next steps in pulling all this together?
We’re training pretty hard with both Jazz and Forge on short reads. The people making the data have got a head-start on us. They’re giving us whole-genome data sets and we need to get those to production level where we can assemble a genome with 25-base-pair reads, maybe with some Sanger data, for bacteria at least.
JGI’s got plans to do some pretty ambitious numbers of microbes in particular over the next five years. The platforms are reaching a capacity where … in a matter of a day or two days you would have eight genomes generated per machine. And then that really puts the pressure on the assembly community to make sure we have fast, reliable, scalable solutions that give you very, very accurate results. And probably shortly thereafter [we’ll] put the pressure on the bacterial community to come up with thousands of DNA samples that are ready for sequencing.
So you don’t want assembly to be the bottleneck.
Bottlenecks move around. They don’t stay in any one place for too long, and the ball is sort of in our court right now. We have data sets to play around with and we know we don’t do a perfect job of assembling them, but then I think once we get the upper hand, my hope is that we’ll push the problem somewhere else. Maybe into annotation, although we’ve been promising them a tidal wave of genomes for years. They’re starting to feel the heat a little bit, but they’ve generally been a little skeptical that we would get out of the production area, particularly with these new platforms. They hear 25 bases and say, ‘I’ll see you in a year.’
Historically, production was limited by financial resources. Anybody could put enough sequencing machines together if they had the dollars. But we’re actually knowledge-limited right now. Even today we could probably completely blow away our current microbial production if we could just reliably do the assembly for 25-mers, with appropriate libraries.
There are even some scenarios where maybe to keep up with the capacity, you would just have to think about doing metagenome mixtures all the time. Maybe the fastest way to get your hands on a couple hundred bacterial genomes is not to isolate 100 individual samples and culture then and grow them up, but just to throw them in a pile and do a joint assembly of the collection. So you can see more production in one of those experiments than the entire history of genomic bacterial sequencing.
You can start to think a lot more creatively once the cost of sequencing drops. And we’re sort of riding on the wave of the much bigger revolution, which is obviously to resequence humans for $1,000. That’s really what’s going to drive the cost down. We’ve got to be a little bit careful to make sure that the goals of the resequencing community don’t diverge so far from the de novo whole-genome shotgun community that we can’t use the platforms for novel genomes. JGI’s mission is primarily novel genomes at the moment. Particularly large genomes.
The hardest thing for us is contemplating doing a plant genome with 25-base-pair reads.