NEW YORK (GenomeWeb) – In a proof-of-principle demonstration that genomes can be assembled from long, noisy nanopore reads alone, researchers at the Ontario Institute for Cancer Research in Toronto and the University of Birmingham in the UK have used data from the Oxford Nanopore MinIon to assemble a bacterial genome into a single contig.
The approach, which uses the Celera assembler to build the 4.6-megabase Escherichia coli genome, is similar in principle to how researchers have assembled genomes using the Pacific Biosciences platform, which also produces long, error-prone reads.
The researchers published their work on the BioRxiv preprint server last week.
Earlier this year, researchers from Cold Spring Harbor Laboratory showed that they can assemble the 12.5-megabase genome of the budding yeast Saccharomyces cerevisiae from MinIon reads, but that project required the nanopore reads to first be error-corrected with the help of short-read Illumina MiSeq data.
The E. coli project, on the other hand, is the first to demonstrate that MinIon nanopore data alone can be used for a de novo genome assembly. The researchers are currently working on improving their method by integrating a consensus calling algorithm that uses the raw nanopore signal rather than reads after base-calling.
According to Jared Simpson, a principal investigator at the OICR and the senior author of the preprint, the workflow for the nanopore assembly is similar to the assembly of PacBio reads using the HGAP assembler. "We really used that HGAP approach as a template," he explained, with a similar workflow but different programs at each step.
In short, the researchers first detect overlaps between nanopore reads using the DALIGNER software. Then they error-correct these reads using the partial order aligner or POA software, which applies a directed acyclic graph to compute a multiple alignment to determine a consensus sequence. The corrected reads are then fed into the Celera assembler. The programs used in the workflow already existed, Simpson said, but were applied to nanopore reads in this way for the first time.
Data for the project – 2D reads from four separate MinIon runs using the R7.3 chemistry – were generated by Nick Loman's lab in Birmingham, an early-access user of the MinIon technology. In total, the project generated about 22,300 2D reads, or about 134 megabases of data, representing about 29x coverage of the E. coli genome.
The largest contig in the final assembly covered the entire E. coli genome and had about 4,000 mismatches and 47,400 insertion or deletion errors, compared to the E. coli K-12 MG1655 reference genome.
Errors occurred particularly in homopolymer regions of the genome, which the scientists wrote was to be expected because changes in electric current might not happen, or not be easy to detect, when several bases of the same kind travel through the nanopore.
To improve the assembly further, the researchers are in the process of incorporating an algorithm, developed by Simpson's group over the last two months, into the workflow that calculates consensus sequences directly from the raw electrical nanopore signals. "That's going to be the real computational advance for our paper, rather than this proof of principle that you can get a one-contig assembly out of nanopore data," Simpson said.
He does not know yet how much the assembly will improve with the new algorithm, but said that homopolymer errors persist to some degree. His team plans to update the preprint with the new consensus calling approach within the next two weeks, prior to submitting the work to a scientific journal.
Down the road, he and his colleagues plan to optimize the pipeline to make it more user friendly and to decrease the run time. They also want to apply it to larger genomes, such as S. cerevisiae, and to genomes with more extreme GC-content.
When it comes to genome assembly, nanopore technology might take a similar trajectory to Pacific Biosciences' platform. Just a few years ago, researchers avoided PacBio for genome assemblies because of the technology's high error rate, Simpson said. But that has changed with new computational approaches tailored to the data, and PacBio has become well-established for high-quality assemblies, most recently human-sized genomes. "Once somebody shows that you can get assemblies out of it, other people come in and improve upon those ideas, and the field really takes off," he said. "I expect the same thing here. This is really just a starting point."
At the moment, the accuracy of the MinIon reads is still lower than that of PacBio reads, but both platforms deliver reads long enough to span many repeat regions in the genome and allow for long contiguous assemblies, he said. The two instruments also differ considerably in size and required infrastructure.
It is unclear, though, how good assemblies from nanopore data will eventually be able to get. "Because the [MinIon] platform is so new and the algorithms are just a few months into development, we are not close to reaching the limit of what the platform can provide," Simpson said. "I think it's still too early to say whether there are fundamental limitations, or what those limitations are."