NEW YORK — A new algorithmic approach can quickly assemble accurate long reads into entire genomes using only the memory included on a laptop computer, according to a new study.
Long-read sequencing technologies like those from Pacific Biosciences and Oxford Nanopore Technologies can generate terabytes of sequence data. De novo assembly of these reads can be resource-intensive, needing both time and computing memory.
Researchers from the Massachusetts Institute of Technology and the Institut Pasteur have developed a new approach that uses minimizer-space de Bruijn graphs, or mdBGs, to assemble long-read genomes. With this method, they put together a human genome in less than 10 minutes using eight cores and 10 gigabytes of random access memory, as they reported in Cell Systems on Tuesday. They similarly could quickly construct an index of a large collection of bacterial genomes that they then searched for signs of antimicrobial resistance genes, illustrating how being able to process sequencing data quickly could enable personalized medicine.
"Until this work, a single human genome assembly took days and hundreds of gigabytes of memory, which is a significant obstacle towards personalized medicine," co-corresponding author Bonnie Berger from MIT said in an email. "Our method mdBG reduces the computational resources to minutes on a personal computer — two orders of magnitude faster than existing methods."
MdBG relies on minimizers that represent short stretches of nucleotide sequences rather than single nucleotides. That way, mdBGs store a smaller portion of the total number of nucelotides, but without affecting the genome sequence.
They applied their approach to assemble PacBio long reads from Drosophila and humans and compared the performance of mdBG to other assemblers like HiCanu, Hifiasm, and Peregrine.
For Drosophila, rust-mdBG — the approach is written in the Rust language — assembled the genome in one minute and nine seconds and used 1.5 GB of memory. Peregrine, by contrast, took 40 minutes and 11 seconds and used 12 GB of memory.
Meanwhile, for a human assembly, rust-mdBG took 10 minutes and 23 seconds and needed 10 GB of memory, compared to 14 hours and eight minutes and 188 GB for Peregrine.
"Beyond genome assembly, our mdBGs can also be used to search for antimicrobial resistance genes very efficiently across huge collections of bacterial genomes, which is key for personalized antibiotic therapy," Institut Pasteur's Rayan Chikhi, the other corresponding author, added.
For instance, the researchers applied mdBG to construct an index for a collection of 661,405 bacterial genomes, a process that took three hours and 50 minutes and needed 58 GB. They further queried the pangenome graph for the presence of antimicrobial resistance genes, which took about 12 minutes, rather than seven hours with other approaches, and used less than 1 GB of memory.
Currently, the approach works best using PacBio reads, the authors noted, as they have very low error rates, and they soon expect it to be able to handle Oxford Nanopore reads.
Berger and Chikhi added that they plan to further develop their approach, for example to resolve entire chromosomes without gaps. "Thinking more broadly, we envision reaching out to field scientists and to help them develop fast genomic testing sites, going beyond PCR and marker arrays which might miss important differences between genomes," they said.