NEW YORK (GenomeWeb) – New graph-based assembly algorithms for long-read sequencing data promise to deliver genomes in less time, reducing the cost of data analysis.
Their development mirrors a similar improvement of short-read assemblers a decade ago, according to Pavel Pevzner, a professor of computer science at the University of California, San Diego, who developed one of the new genome assemblers, called Flye.
Flye and wtdgb2 are both genome assembly algorithms that use a De Bruijn graph-based approach, like the one Pevzner developed for Spades, a short-read assembly algorithm released in 2012.
"For me it was almost like déjà vu," Pevzner told GenomeWeb. "The same computational approach we introduced for short reads can work well for long reads. You start from reads and you can create a network of roads and bridges that answers the question, 'How do you walk this network in a way that spells out the genome?'"
For labs that have compared Flye and wtdbg2 to the Canu assembler, another long-read assembler that looks for overlapping contigs, the results have confirmed that the graph-based algorithms are faster.
Nick Loman, a microbial geneticist and bioinformatics expert at the UK's University of Birmingham has run both Flye and wtdbg2 on metagenomic samples sequenced with long-read technology from Pacific Biosciences and Oxford Nanopore. "Wtdbg2 and miniasm [an older graph-based assembler for long reads] are probably the fastest," he said. "Flye is in the middle, and Canu, in our hands, is typically the slowest one."
Though fast on metagenomic samples, wtdbg2 was developed for assembling human genomes, according to Heng Li, a bioinformatician at the Dana Farber Cancer Center and co-author of a BioRxiv preprint published in January that introduced the algorithm. Jue Ruan, of the Chinese Academy of Agricultural Sciences' Agricultural Genomics Institute in Shenzhen, led development of the algorithm. "It can assemble a human genome in a couple of days, and we can do the assembly for several dollars, maybe $10," Li said.
Pevzner had more specific claims about Flye's speed, which translates into reduced cost. "The speed is an order of magnitude faster than Canu, the run time is much lower, across all genomes," he said. By corollary, the cost of computing is "reduced by a factor of 10," he said.
"With Canu, we reached a point where the cost of computing exceeded the cost of generating data," he said. "With Flye, we're once again where the cost of computing is below the cost of generating data."
Pevzner's team published a paper describing the algorithm and their benchmarking studies in Nature Biotechnology last month.
"It's nice work," Adam Phillippy, a bioinformatician at the National Institutes of Health and the lead developer of Canu, said in an email. "Flye is a very capable assembler both for individuals and metagenomes."
PacBio has looked at both wtdbg2 and Flye internally, according to CSO Jonas Korlach. He confirmed that both are fast but noted that speed is just one of several metrics by which to judge an assembler.
Several researchers told GenomeWeb that the De Bruijn graph assemblers showed some drawbacks, including a higher rate of misassembly for wtdbg2 and exceedingly large memory requirements for Flye, especially for larger genomes.
Korlach added that his company's internal data haven't suggested that researchers are abandoning long-read assembly projects due to a lack of budget for computational resources.
While short-read sequencing has produced highly accurate reads, assembling those reads into genomes, especially ones that account for structural variants, has been an unmet challenge. Long read data from PacBio and Oxford Nanopore have provided data to span long repeat sections but have been more error prone.
The errors in long-read data have made assembly challenging, Pevzner said, and have inflated computing costs. "It's more difficult to see similarities" between reads, he said, and therefore more difficult to piece those reads together.
Long-read assembly algorithms have historically stopped once they encountered repeat stretches or multiple ways of assembling the genome, Pevzner said. "If they don't have information on how to go further, they stop," Pevzner said. "In our case, we don't stop. We just continue extending the genome in whatever crazy way we can."
As Pevzner and his coauthors wrote in their paper, Flye, when considering what to do next, switches "to any other overlapping read rather than a carefully chosen overlapping read," avoiding a "time-consuming test" to check if the read selection was correct.
Wtdbg2 handles the assembly by using a concept called "fuzzy De Bruijn graphs," Li said. "If you use a typical De Bruijn graph, you wouldn't allow mismatches between the two sequences. But the fuzzy graph math allows some mismatches and merges sequences together, which allows the algorithm to build the graph."
Pevzner's team conceded that "it may appear counterintuitive that inaccurate contigs constructed by [Flye] result in an accurate assembly graph." But Pevzner said the results of their benchmarking studies showed that it worked. "We do fewer errors and more contiguous assemblies," he said.
In the Nature Biotechnology paper, Pevzner's group reported that when assembling the yeast genome from PacBio data, Flye had an NG50 of 670 kb and five misassemblies, compared to 708 kb and 13, respectively, for Canu and 562 kb and 27 for Falcon, PacBio's proprietary assembler. Pevzner said Flye was also 40 percent faster than Canu. For the human genome, Flye's NG50 was 7,886 kb, compared to 3,209 kb for Canu, and the algorithm registered 879 misassemblies, compared to 1,200 for Canu.
Compared to Canu, Loman said the graph-based assemblers could do jobs that Canu couldn't even complete. "When we tried Canu on some of these data sets, we were not able to get it to complete in any reasonable time frame. Jobs were running for week, or even months, before they got killed," he said. "Flye we were able to get running on a single server with 48 or 96 cores, and got the results done within a day or a couple of days."
Compared to each other, wtdbg2 is "probably four to five times faster than Flye, for the human genome," Pevzner said.
But Loman, who compared the assemblers for bacterial genomes, said Flye's results "seem to be a bit better" with regards to misassemblies. He added that Flye doesn't require the user to specify lots of parameters. "Expected genome size is the main parameter that’s user configurable," he said.
Li said that Flye uses two or three times as much memory as wtdbg2. "For a plant genome that is five times larger than the human genome, wtdbg2 is probably the only assembler that can run independently on these large datasets," he said, suggesting that Flye would need too much memory to complete the task.
PacBio's Korlach predicted that over the next several years, "one or two winners" will emerge as the go-to software for long-read genome assembly.
"Where we are in the evolution of long-read assemblers is an explosion of different tools and approaches, and [De Bruijn graph assemblers are] certainly one of the branches on that tree," he said.