NEW YORK – Newly released algorithms can assemble de novo human genomes from long read sequencing data in just a few hours' time.
Shasta, an in-memory computing-driven algorithm developed by researchers at the Chan Zuckerberg Initiative (CZI) and tested by researchers from the University of California, Santa Cruz, can complete a de novo human genome assembly in under six hours, the authors wrote, for an average cost of $70 per sample.
Using reads generated by the Oxford Nanopore Technologies PromethIon sequencing instrument, the researchers were able to create "near chromosome-level" scaffolds for eleven genomes. While Shasta had less-contiguous assemblies (contig N50s between 19.3 and 37.8 megabases) than some other long read assemblers, Shasta had fewer misassembles, the authors wrote. They posted their study to BioRxiv July 26.
And earlier in July, two former Pacific Biosciences veterans, working on their own now, described Peregrine, an assembler that uses an indexing scheme to assemble reads that meet certain accuracy and length requirements. Using previously generated datasets of PacBio long reads, the authors reported that they were able to assemble a genome with 30x coverage in 100 minutes wall clock time. The N50 score was greater than 20 megabases. They also posted a preprint to BioRxiv.
Developers for both algorithms said they hoped their assemblers could increase the pace of genomic research and help researchers find new structural variants.
"Shasta and other tools are cheap and quick, designed with the intent to be on the cloud," said Benedict Paten, a computational geneticist at UC-Santa Cruz and an author of the Shasta preprint. "They really give us the power to scale out nanopore sequencing. We're easily talking about assembling hundreds of de novo genomes in the next couple years."
The developers said that hundreds of users had downloaded the software off GitHub, where both algorithms are available. At least one researcher has already gotten good results using Shasta.
"For me this was crazy fast," said Robin Buell, a plant genomics researcher at Michigan State University, who used Shasta to assemble an Arabidopsis genome in just over 40 minutes. Using Canu, the previous best-in-class assembler, it would have taken more than four days, she said. Moreover, getting it installed wasn't challenging. "Not that it was trivial to do, but it was trouble-free," she said.
Shasta and Peregrine are the latest entrants to the field of genome assemblers that use long read sequencing data. Canu, developed by bioinformaticist Adam Phillippy, of the National Human Genome Research Institute, was the first assembler for long reads. Based on the Celera Assembler for Sanger sequencing, it was published in 2017. Earlier this year, two groups revealed long read assemblers based on De Bruijn graphs: Flye and wtdbg2, also known as Redbean.
But there is always room for improvement.
In 2018, Paten led researchers in assembling a de novo human genome using nanopore data, which was published in Nature Biotechnology. Canu was the only choice, so the group was forced to use a large computing cluster to process the thousands of hours of compute time, which took weeks.
"When I saw that last year, I thought there must be a better way," said Bruce Martin, director of engineering at CZI. Developing a new assembler fit into the organization's goals of collaborating with scientists and building open-source tools to make processes better, faster, and cheaper, he said.
Martin enlisted Paolo Carnevali, a software engineer at CZI, who previously worked at Complete Genomics. They started in mid-2018 and "combined a set of novel algorithmic approaches with practical engineering enabled by commodity computer hardware," Martin said.
One of the keys to Shasta's performance is using large memory machines, with more than 1 terabyte of RAM. "It works entirely in memory," Carnevali said. "If you do this, you never go to the disk and you never wait for data, so everything is faster." The algorithm also uses multithreading, whenever possible, he added, where the division of work is done dynamically. "It allows you to keep your CPU utilization very high," he said.
In a similar manner, Peregrine takes a new computational approach that enables its speeds.
"Genome assembly has always been analogous to people solving a jigsaw puzzle," said Jason Chin, one of the developers. "The old way to do it was to literally look at every piece and compare it to each other piece. Our approach is more like how humans solve it. We match the color or pattern and put them in piles first, so we can reduce the search space by using similar features of the read."
The algorithm accomplishes that by indexing reads using a minimizer, or a k-mer that groups similar reads together.
So far, Peregrine has only been run on Amazon Web Services cloud instances. "We haven't tried this, but it may be possible to just buy a high-end computer and do it on your own," Asif Khalak, Chin's codeveloper, said.
Chin noted that Peregrine was designed specifically for reads meeting certain specifications: 99 percent read accuracy and length of 10 kilobases, or longer. "If you're too short, you won't get enough index," he said, and it can't handle reads with more than 3 or 4 percent error.
Another thing that might work, but that they haven't tested: polishing reads before assembly. "There's no reason why it wouldn't work," Khalak said. He suggested that Peregrine itself could be used for polishing reads prior to assembly.
The assemblers could enable numerous applications. Neither assembler was specifically designed for human genomes, so de novo sequencing of other species without a reference genome is the most straightforward. Though it has already been sequenced, Buell's lab has already begun working on a Lavender (Lavandula angustifolia) genome assembly.
Variant detection in cancer genetics is another potential application, especially larger variants. Martin suggested de novo nanopore sequencing "will give us the microscope we need to look at structural variants and hopefully help understand that process and the clinical implications of them."
And both developers are cognizant that there could be new applications not yet thought of. CZI has made Shasta available under a permissive MIT open source-style license. "Effectively that means anyone can take the source code and do whatever they want," Martin said. "They can fork and modify it and even commercialize it."
Khalak said that Peregrine is just the start for the fledgling Foundation for Biological Data Science, which he recently launched with Chin. "When you have dramatic changes in speed and cost, it's going to open up new possibilities," he said. "Part of our mission is to think about what that enables. We're trying to explore that."