NEW YORK – Researchers at the University of California, Riverside, and the Agricultural Genomics Institute at Shenzhen, China, have assessed the most popular assembly tools for high-fidelity data from Pacific Biosciences long-read sequencers, testing them on data for eukaryotic genomes and metagenomic datasets, both real and synthetic.
They concluded that Hifiasm, a tool developed by Dana-Farber Cancer Institute informatician Heng Li and colleagues that has been an important asset for the Human Pangenome Reference Consortium, is the best of the bunch for the plant genomes they tested. However, the US National Institutes of Health's HiCanu and the University of California, San Diego-developed HiFlye also performed well with synthetic data.
While others showed strengths, they often required more computing power than the highest-performing assemblers, according to a preprint recently posted to BioRxiv. And despite HiFlye's strong showing with synthetic datasets in terms of contiguity, completeness, and accuracy, it "failed" some tests with real sequences, according to the researchers.
Lead author Weihua Pan and colleagues benchmarked 11 different HiFi assembly tools. HiCanu, Hifiasm, HiFlye, MECAT2, Miniasm, NextDenovo, Shasta, Peregrine, and Verkko were all tested on eukaryotic genomes, and the researchers benchmarked Hifiasm-meta and MetaFlye with metagenomic data. "On metagenomic datasets, Hifiasm-meta demonstrated a clear advantage over other assemblers," they wrote.
PacBio CSO Jonas Korlach said in an email that there is definitely room for improvement among HiFi assemblers for metagenomic applications. "Metagenome assembly is one of the most challenging assembly problems, due to the presence of multiple species with uneven (and unknown) abundances, and conserved genomic regions that are shared across species and strains," he said.
In all cases, the UCR-Shenzhen team evaluated the assembly tools on metrics including sequencing coverage, heterozygosity, and ploidy.
The paper is meant as a reference for researchers trying to pick the right assembler for their HiFi sequencing project and to inform future improvements to long-read genome and metagenome assembly. "I can know that HiCanu and Hifiasm are really, really good compared to the other seven, so you don't need to consider seven," said UC-Riverside computational biologist Stefano Lonardi, a co-author on the preprint.
One caveat: Hifiasm was "least sensitive to the sequencing coverage," according to the paper, but only with coverage of at least 20X. "It clearly ranked first in the overall performance," the authors wrote.
The paper is not meant to be a "comprehensive" assessment of HiFi assembly tools but included the most popular assemblers that are currently maintained, according to Lonardi. He said that benchmarking assemblers became necessary because of the growth of PacBio HiFi sequencing, which has become his sequencing technology of choice because of the high quality and length of the reads, which he claimed make assembly easier than other types of reads.
"Essentially, the errors are uniformly distributed and there is not very many of them," he explained. "It's easier to assemble them because you don't need to do error correction."
Lonardi tends to pair HiFi reads with optical mapping, but he still needs to choose an assembler, which he said becomes more difficult when working with nonhuman genomes.
The researchers tested the assemblers on three plant genomes: homozygous diploid rice, heterozygous diploid potato, and autotetraploid wax apple. They augmented the actual sequence data with synthetic data that Pan created for control purposes. Lonardi said that some of Pan's collaborators in China are working on a separate paper on the complex genome of the wax apple.
Assembler choice could depend on genome size, repetitive content, and ploidy, according to Lonardi, and existing assessments have been spotty. "We felt there was a need to do something much more comprehensive on real data and on synthetic data," he said.
He noted that metagenomic assembly is a newer domain. "It's actually a little harder to know exactly what you're expecting because you have a sample which contains thousands of genomes, and the tools are definitely less developed," he said. "We designed some metrics to measure how good the assembly is, but honestly, we don't have the ground truth there."
According to PacBio's Korlach, the ideal HiFi metagenome assembler "would recover complete genomes for all species with sufficient coverage, including full strain-level resolution, and including their associated epigenome (methylation), and with information of the complete sequences of the specific plasmids and/or bacteriophages that the different bacteria harbor." Current offerings fall short when it comes to linking plasmids and bacteriophages to host bacteria, he said.
Korlach noted that the assembler review does not include a new metagenomic assembler called MetaMDBG from researchers at NIH, the Earlham Institute in the UK, and the Pasteur Institute in France that was just described in a preprint released last week.
Through their preprint, the Shenzhen-UCR researchers have made their data publicly available. "We hope that these datasets could be something of a benchmark to develop new methods," Lonardi said.
One of the assemblers that performed well, Verkko, builds on the T2T Consortium project, which integrated ultralong Oxford Nanopore sequencing reads with a high-resolution assembly graph built from PacBio HiFi reads. Researchers at the National Human Genome Research Institute (NHGRI) and their colleagues described that assembler in a paper that appeared in February.
Also, Human Pangenome Reference Consortium (HPRC) researchers last year published the results of their evaluation of about two dozen diploid human genome assembly methods, and Hifiasm was the "clear winner," Erich Jarvis, a neurogenetics researcher at Rockefeller University who led that work, told GenomeWeb at the time.
This week, Jarvis said in an email that Pan and colleagues reached similar conclusions to his own group, finding that Hifiasm did the best overall job with HiFi assembly. However, The UCR-Shenzhen team did not cite the work by Jarvis.
Jarvis noted that HiFi data often contains gaps in GA-rich regions, which can be filled in with Oxford Nanopore sequencing data, even though the latter tends to be less accurate overall.
Hifiasm and Verkko combine PacBio HiFi, Oxford Nanopore ultralong, and Hi-C sequencing data. "The reason why these assemblers work better than all others is because they include all data types in the assembly graph at the same time, instead of sequentially, and they phase the haplotypes in the assembly graph, reducing errors," Jarvis said.
Lonardi would like future research to look at genome maps in the assembly process. "It would be nice to have a comprehensive assessment of … the pros and cons of different maps in the context of genome assembly. That's something that I don't think is out there yet," he said.