This week, a team of scientists from Washington University in St. Louis, Iowa State University, the US Department of Agriculture, the University of Arizona, and Cold Spring Harbor Laboratory released the draft sequence of the corn genome at the 50th Annual Maize Genetics conference in Washington, DC.
The three-year project, which cost approximately $30 million, was particularly difficult due to the complexity of corn’s roughly 2 billion base pair genome: It contains 50,000 to 60,000 genes — twice as many as humans — as well as long stretches of repetitive code and tens of thousands of mobile elements.
These features posed a number of challenges for genome assembly, so the consortium used a set of software tools developed at Iowa State by Patrick Schnable, director of the Center for Plant Genomics; Srinivas Aluru, professor of electrical and computer engineering; and PhD student Ananth Kalyanaraman.
One program, called PaCE (for Parallel Clustering of ESTs), can speed assembly while another, called LTR_par, uses parallel programming to identify long terminal repeat retrotransposons, mobile genetic element that can cause mutations, gene duplications, and chromosome rearrangements.
Both packages were designed to run on CyBlue, Iowa State's 5.7-teraflop IBM Blue Gene/L system.
BioInform spoke to Aluru this week about how these tools helped make sense of the highly repetitive corn genome sequence. The following is an edited transcript of the conversation.
Can you outline some of the computational challenges in assembling the corn genome and why you needed to develop some new software to do this?
The chief difficulty comes from the abundance of repeats in the maize genome. Estimates put it anywhere between 65 to 80 percent repetitive. And the problem with repeats in assembly is that we can only sequence short fragments, like 700 to 850 [bases]. So when you have a large number of these short fragments, if the genome has many repeats, you will find many alignments or overlaps that look like they come from the same part of the genome, but because they are repeats they come from different parts.
The more repeats, the more difficult it is to do the assembly and the higher the chance of mistakes. So the chief difficulty in assembling the corn genome comes from the abundance of repeats, which is much worse than the other genomes that have been sequenced so far.
Another thing is that these repeats have a short evolutionary history, which means that they didn’t have as much of a chance to change over time, and therefore they look identical. So it would be hard to separate them by looking at the accumulated mutations over time.
So this problem with repeats isn’t something that could be addressed by deeper coverage in the sequencing process?
Deeper coverage can only go so far. The main problem is that the fragments look like they are from the same place even though they’re all over the place. And deeper coverage by itself can’t address this problem. It’s definitely useful, in particular if the coverage is also obtained by having clone pair information … That is kind of helpful in terms of resolving ambiguities, but coverage by itself doesn’t solve the problem.
The other reason we needed to develop new tools is that if you take a traditional assembler — and there are over a dozen assembly programs developed so far — in principle what they do is look for good alignments between the input sequences, and then use that information to put them together. Now if you have a lot of repeats, what happens is that many of the sequences overlap, so if we’re looking at a genome with 65 [percent] to 80 percent repeats, the number of overlaps would be huge. So in general if I take a genome to be kind of like a random sequence, I would expect that the number of overlaps grows linearly with the data size. But if you have lots of repeats, in the worst case it could grow as high as quadratically.
So you need new algorithms to address that or else it will just overwhelm the assembler.
Can you walk me through the PaCE and LTR_par algorithms you developed?
I should first say that the maize genome sequencing is done using a very old technique because of the difficulty in doing it. We have been doing whole-genome shotgun sequencing for many of the recent genomes — the human, the mouse, the chimpanzee — but for the maize genome, the team at NSF decided to go backwards in time, so we are using a BAC-by-BAC sequencing strategy. This strategy is basically to cut the genome into bacterial artificial chromosomes, and each of them is about 175,000 to 250,000 base pairs in length. So we come up with a large number of these BACs to cover the genome, and then find a minimum tiling path of these BACs that covers the entire genome, and then go and sequence each BAC separately.
So we never actually had to put the whole genome together like whole-genome shotgun sequencing. And this conservative approach was adopted due to the emphasis on the quality of the result. So we’re not looking at taking 30 million or 40 million fragments directly and assembling them, but we are actually sequencing each BAC separately, so that’s a much smaller problem.
So in that context I can tell you what these new assembly tools do.
PaCE is actually not a program that does a single task, but it’s a parallel framework that can be tweaked in different ways to do different things. What it does in general is that if you give it a large collection of DNA sequences, you can specify certain rules, and based on those rules it will bin those sequences.
For example, you might say that the rule is that if there are alignments between pairs of sequences that have a certain percent identity at a certain length, then we would like to put them in the same bin, and we would also like to do this transitively. So, for example, if A overlaps B, and B overlaps C, then I would like to put A and B in the same bin and B and C in the same bin, but by transitivity A and C also end up being in the same bin.
There are a couple ways we used this tool. Before the maize genome sequencing project, NSF funded the pilot sequencing of maize [because] they wanted to try out a couple of sequencing techniques, sample the genome and so on.
These projects finished in early 2003, and they generated over a million sequences. So one thing that we would like to do in this project is not only take advantage of the data that is coming out of Washington University, but also use the sequences that were generated before.
We could use PaCE to do this. What we can do is take these BAC sequences that were sequenced, and say I want each bin to represent one BAC sequence, and now I want to take all of these million-plus sequences that were generated before, and I want to bin them according to the BACs. Because if I do that, not only can I do the assembly using the BAC sequences, but I can also pull in previously sequenced pieces, and then I could enhance the quality of the assembly.
Another example is once we do the assembly of all these BACs, then we want to take all these BACs together and then analyze them. So we started out with the minimum tiling path for these BACs, but it would be nice to independently verify that tiling path to see if it was correct, because it’s possible that some mistakes could have been made.
Another thing we could do is take all of these BACs and assemble them into BAC supercontigs to get long stretches of the genome that are as long as possible.
So there are many ways that you can tweak the software for various applications.
It’s parallel software, and all of our work is done on an IBM Blue Gene. It’s a one-rack system that has about 1,024 nodes and twice as many processors.
Was all the assembly for the maize project done on that machine?
The lion’s share of the work was done by Washington University. They generated all the sequences and they also did initial assemblies themselves, because the BAC is pretty small. You can take those fragments and use a traditional assembler to do the assembly since you’re not doing whole-genome shotgun assembly. So they have done these BAC assemblies themselves as well.
We used PaCE to pull in these sequences that were generated earlier, and then we gave it to them, saying, ‘These sequences can be added while you’re doing the BAC assembly.’
Another thing is that once they have done the BAC assemblies, we verify them and also figure out how they overlap and try to build contigs and supercontigs and so on. So some of the assembly work, and especially the BAC assembly, is done at Washingon University.
So you glue it all together?
Our job is to refine the assemblies and help them produce better assemblies by telling them what sequences are going into individual BACs. And also we’re trying to glue them together.
In fact, we have a lot of exciting work that we need to do over the next year, because what is being released now is the draft genome, and we still have to work towards refining that.
Is that what the LTR_par software is being used for?
Yes. The LTR_par software is also useful in terms of improving the assemblies. Genomes have mobile elements called retrotransposons, which are sometimes called jumping genes. They go and reinsert themselves somewhere and they keep proliferating.
In the maize genome, more than 50 percent of it is due to these jumping genes or mobile retrotransposons, and a large class of these mobile elements are called LTR retrotransposons, where LTR stands for long terminal repeat. The characteristic measure of these things is that at either end they have a long sequence that is almost identical. So LTR_par is a parallel program that can scan through a genome and identify potential LTR candidates. And then, once we have an understanding of these LTRs, we can take them and then look for repeats.
What we did was develop a database of repeats in maize, and then we turned it around and used it as an assembly tool. So suppose that I have two contigs, and … I find an LTR at the end of one of them and an LTR at the beginning of the other, then that’s an indication that maybe they are close together on the genome, so that way we can scaffold these contigs without actually knowing the sequences in between.
So instead of looking at the repeats as problem, we said what if we turn them into an advantage by taking our knowledge of known repeats and then using them to do scaffolding. So we were able to do that and improve the scaffolding.
Even with this sequencing, on average we could not completely get the BACs sequenced. There would be on average about 10 contigs per BAC, so anything we can do to scaffold is very useful.
What are you working on now? You mentioned that there is a lot of work ahead now that the draft is in place. Are you refining these tools or building new ones?
We have already done a few things. One of them is that we used the tools we already developed to do an analysis of the sequences that we have so far, and by doing so we discovered about 350 novel genes.
These are genes that we found that are only in maize and not in any other organism known so far. So they are not in rice, they are not in sorghum, they are not in any other cereal crops. And we did biological experiments to validate them and we found that many of them indeed are genes, in the sense that they are being expressed by the maize plant … That was published in the Proceedings of the National Academy of Sciences in 2005.
Maize has another interesting feature. It has multiple copies of the same gene that are almost identical. We call them NIPs, for nearly identical paralogs. We found that there are several genes that are found in nearly identical copies of themselves, and this also complicates the assembly because when you try to run them through the assembler, it looks like they are the same copy and therefore the NIPs often get collapsed into one copy.
So we did some work on identifying the NIPs and trying to separate these copies. So that is the second thing that came out of these tools.
In the future we would like to do a little bit more work on refining the draft genome, and also do comparative genomics — so for example, compare maize with rice, and compare maize with sorghum and through comparative analysis learn about all of these organisms.