Researchers from the University of California, San Diego, University of Massachusetts, and the J. Craig Venter Institute have taken a "meta" approach to assembling short-read sequence data.
In a paper published last month in PLoS ONE, the team demonstrated that the so-called Meta-Assembly approach could be used to assemble a complete bacterial genome de novo without Sanger reads.
According to the team, Meta-Assembly has two key elements. It starts with a combination of data from the Illumina and 454 sequencing platforms and integrates these two data types "early in the process to maximally leverage their complementary characteristics,” the researchers wrote.
“Second, we used the output of different short read assembly programs in such a way as to leverage the complementary nature of their different underlying algorithms,” they said.
The approach has four phases. In the first phase, rather than using Illumina reads to correct errors in an assembly generated from 454 reads, the researchers integrated both types of sequence data at the beginning of the assembly to generate hybrid contigs. This step "reduced the number of degenerate nucleotides in the assembly compared to when they are used just for error correction,” the researchers wrote.
In the second phase, the researchers “maximized the complementary information provided by different assembly algorithms.” In this step, the team used a combination of algorithms, including Euler-SR, Velvet, and Roche’s Newbler, to create an assembly composed of four scaffolds.
In the third phase, the researchers used PCR to order the scaffolds into a circular genome. Finally the team aligned Illumina reads against the ordered scaffold to account for indels and errors that may have occurred in the second phase.
Last week, BioInform spoke with two of the researchers involved with the project. Christian Barrett is currently studying cancer genomes in the Division of Genome Information Sciences at the UCSD Moores Cancer Center. Harish Nagarajan is a graduate student in bioinformatics and systems biology and in the lab of Bernhard Palsson in the bioengineering department at UCSD.
Below is an edited version of that interview.
Could you give some background on the kinds of challenges researchers face when assembling short reads de novo?
Christian Barrett: Historically, genome sequencing has been done by the Sanger method, which is quite accurate but it's very low throughput so you don’t get a lot of DNA sequence for the labor you put in. The second-generation tools that we’re using that have become available in the last few years can generate a lot of sequence data. It’s not quite as accurate as Sanger sequencing but there is a lot more of it.
With a lot more data you can compensate for the slightly higher error quality but the technologies only currently generate short pieces of DNA — so 36, 50, now upwards of 100 base pairs. Due to the complexity of the sequences in genomes, putting these short pieces together to get a complete genome is very difficult. What we have shown is a way to actually put these very short pieces together to get a complete genome.
Harish Nagarajan: You might ask why we should go for these short reads. Even if it’s labor intensive why not just go for the long, traditional approach. But these new technologies give you a lot more data and [are] also cheaper, much less labor intensive. The time [it takes to] get one run of these is also very fast compared to the long, drawn-out processes of the traditional technologies.
Over the last year and a half, many people have tried to come up with great algorithms to put these [short reads] together into longer reads but this is [one of] the first where you actually get one complete genome.
Tell me more about past attempts to address these challenges.
CB: There is a commonly adopted approach, which we also demonstrate in the paper that if we had followed would have given us a much lower quality assembly. One standard approach is to combine technologies. So there are a couple of different sequencing machines that generate short read data but with slightly different characteristics. One common approach is to take the data from one of these machines and try to create a genome assembly from that data and [then] use the other data to “clean” that assembly. To make it do error correction. I think there are a couple of other techniques but they are minor variations of this. There are also more specialized, not widely adopted techniques, but I think I just described to you the dominant one.
HN: I would probably categorize past attempts into two different phases. The first phase would probably be where people first came out with algorithms to build large contigs of the order of about 1,000 to 10,000 base pairs from 100 base pir level, and then the second phase where people integrated different technologies to get better assemblies.
Our approach is a major improvement of the second phase.
Describe how the Meta-Assembly approach works.
HN: We call it Meta-Assembly for two main reasons. [One,] because we integrate two different types of sequencing technologies: Illumina and 454. The second integrative aspect of this whole process is not only did we use different technologies but we also used different assembly algorithms and exploited the complementary nature of those existing algorithms in order to leverage the complementarities in there and maximize the information that we could get.
To give more detail, let’s say broadly our approach is classified into four phases. The first phase is the hybrid assembly where we integrate Illumina and 454 read technology. The difference is basically the timing of the integration of the technologies. We integrate Illumina and 454 reads up front and then generate the draft assembly whereas the traditional approach has been using 454 reads to generate the draft assembly [and] then just [using] Illumina reads on top of it. This [step] gave a significant improvement in the quality of the assembly.
Then the second part of our approach is where we maximize the complementary nature between the two different assembly algorithms.
More importantly, during the second phase, at each stage of the assembly, we look for functionality and then check whether there are any duplications in the genome and look for complex features. This is one additional feature of our report where we actually look in the context of that organism as well instead of just treating a sequence of letters.
In our third phase, basically we use a standard PCR-based approach to order and orient these scaffoldings into a circular genome. This is traditionally confused with gap closing but it’s actually not that. You are not adding any sequence information but you are just saying how A and B should be oriented among relative regions. Should they be right next to each other, should they be the other way around — that kind of information.
The last phase is where we do a finishing step.
In summary, the difference would be traditional approaches just adopt phase A in a suboptimal manner and just an aspect of phase D. We have improved two phases and we have added two extra phases in our approach.
Please elaborate a bit more on the algorithms that you used.
CB: The main algorithms are basically the most popular short-read assembler algorithms. Typically they are Euler, Velvet, and the Newbler assembler, which is a program that comes with the 454 sequencing technology.
What we have done is use them as components in our approach.
So the main difference between your approach and other approaches has to do with when the sequence data is integrated?
CB: There are two classes of approach. One says I have one tool and my sequence data. I am going to try to do as well as I can with this one tool and this data. [Another approach says] well I have one or two different types of data and I have more than one tool and I am going to try to use those tools with these data in a certain way and see how well I can do.
Our approach fits into the second category but we actually look closer at the algorithms and how they could complement each other because different algorithms have different strengths and weaknesses. We want to say what are the strengths of this one and what are the weaknesses and then we can complement that with another tool.
We also look at the data and say, 'There is information in this data — the information being the complete genome sequence — so is there a way to pull out more information from this data?' We found that by integrating the data earlier and in multiple ways with different applications of these different tools, we could actually extract more information.
Until the day comes when a machine can sequence megabases with little error and then you can just do your assembly in Microsoft Word, you are going to need to combine tools with the data in smart ways to get to the final answer, which is the complete genome.
In your paper, you compare your Meta-Assembly approach to other approaches. Can you tell me a little bit about your findings?
HN: We just compared three other algorithms that we used as components of the Meta-Assembly. Essentially if you use each one of them as the sole algorithm, you are going to get much lesser quality of an assembly; you are not going to get a complete genome, whereas if you combine them in the way that we combined, you are definitely getting a complete genome.
CB: And we also compared [our approach] to what [would happen] if we had just used the commonly adopted approach and we found that you get a much less complete assembly.
For this study, you used Illumina and 454 data. Have you used any data from other sequencers?
HN: Not really; we just used it on Illumina and 454 reads. These two are the main technologies that are being used predominantly in the community. I think we can adapt our approach to new technologies; we just haven’t.
CB: There is Illumina, 454, and now actually, the SOLiD by Life Technologies is probably the other major sequencing technology. I am not sure if the algorithms we used would have been adapted to the output from that technology.
But this points to the paradigm that this paper embodies in that as new technologies come along, and new tools are developed for the existing technologies, those can be put into the general framework that we are describing, but its an open problem how best to do it.
What’s demonstrated here is if you have Illumina and 454 and these tools available, here is a way to maximize your information gain. But as new tools come along, the strategy we’ve outlined will have to be modified and changed and it’s going to be an evolving assembly approach.
What are next steps for you?
HN: We are trying the same approach on a few other genomes; some of them are biological, some of them are very clinical. We also contemplating whether we will integrate different complementary data types like an optical mapping type of approach into the whole pipeline and that could be a completely new paradigm in the whole space.