This story has been updated to remove inaccurate information about the assembly method used in the construction of CHM13 reference genome.
CHICAGO – Nearly two years after the unveiling of the first complete, telomere-to-telomere human genome, researchers at the National Human Genome Research Institute (NHGRI) and their colleagues have produced the most automated genome assembler to date to assist in the building of future complete sequences.
The team, led by NHGRI researchers Adam Phillippy and Sergey Koren, described the algorithm, called Verkko, in a paper published last week in Nature Biotechnology, following a preprint posted last summer.
Verkko builds on the work of the Telomere-to-Telomere (T2T) Consortium, which relied on manual integration of ultralong Oxford Nanopore Technologies (ONT) sequencing reads with a high-resolution assembly graph built from long, accurate Pacific Biosciences high-fidelity reads.
The algorithm followed the T2T blueprint of assembling HiFi data to make a "high-quality" graph, but then innovated by automating the addition of ultralong ONT data to address some of the ambiguities from repeats and to allow for haplotype phasing, according to Phillippy, head of genome informatics at NHGRI, a unit of the US National Institutes of Health, and co-chair of the T2T Consortium.
According to Koren, associate investigator in genome informatics at NHGRI, Verkko is suitable for anyone interested in studying complex regions of the genome across large numbers of samples.
Phillippy said that the moderate-length PacBio HiFi reads are "very accurate," able to call effectively all heterozygous variation in the genome. According to the paper, Verkko should also be able to use ONT duplex sequencing reads instead, which Phillippy called "kind of a HiFi analog."
Ultralong read data does not have to be as accurate. "It just needs that length to start connecting distant SNPs together to help with the phasing to help resolve how many repeat copies are in this repeat array," Phillippy said. "It's the high accuracy plus the long length … that's been the key, in our experience."
As described in the paper, the Verkko algorithm starts from a multiplex de Bruijn graph built from PacBio HiFi reads and progressively simplifies this graph by integrating ultralong ONT reads and haplotype-specific markers. "The result is a phased, diploid assembly of both haplotypes, with many chromosomes automatically assembled from telomere to telomere," the NHGRI team wrote.
"With its ability to resolve complete haplotypes, Verkko ushers in a new era of comprehensive genomic analysis … [with] direct application for the construction of new reference genomes, and, ultimately, to better understanding of the relationships between large, complex structural variation, phenotype, and disease."
Virkko indexes reads with what is known as a minimizer, or a k-mer that groups similar reads together.
The developers then added to the de Bruijn graph an algorithm called GraphAligner, which eliminates some of the noise in ultralong Oxford Nanopore reads to find the best path for assembly. Mikko Rautiainen, first author of the Nature Biotechnology paper, developed GraphAligner at the Institute for Molecular Medicine Finland before starting a postdoctoral fellowship at NHGRI.
With Virkko, the NHGRI team was able to produce complete assemblies of 20 of the 46 human chromosomes. While they eventually want to cover all 46, Phillippy said that completeness is dependent upon sequencing coverage, and that this is about as well as they can do with current long-read technology.
With typical HiFi coverage of about 40X and ONT coverage of 10X to 20X, NHGRI was able to produce about 20 contigs with Virkko alone.
For its CHM13 human reference genome and in the subsequent application of Verkko, the T2T Consortium had to resort to manual assembly for the five acrocentric chromosomes in humans. "There are some fundamental limitations in that there are things that we can't resolve even with ONT data, like the [ribosomal] DNA arrays on the short acrocentric chromosomes," Phillippy said.
HiFi sequencing also still has some biases and gaps, he said, and the automated assemblies were incomplete even though NHGRI tried to compensate with ONT data.
"It's a matter of the sequencing tech continuing to improve and improving some of their biases," Phillippy said, though he admitted that development needs to continue on algorithms like Verkko to make sense of what he called the "last, most complex" genomic region, ribosomal DNA.
"I think that the algorithms currently are squeezing nearly all of the information out of the data that they can," Phillippy said. "I think the path to the fully automated T2T genomes will really come with future upgrades to the sequencing tech."
Oxford Nanopore's ligation sequencing kit version 14, also called kit 14, "looks really, really promising" for improving accuracy and reducing some sequencing biases, he noted.
Human Pangenome Reference Consortium (HPRC) researchers last year published the results of their evaluation of about two dozen human genome assembly methods, and Hifiasm, an assembly method for PacBio HiFi data was the "clear winner," Erich Jarvis, a researcher at Rockefeller University, told GenomeWeb at the time, but he also hinted that Phillippy and colleagues at NHGRI were close to a breakthrough.
Verkko goes beyond Hifiasm by adding the Oxford Nanopore data.
The original T2T Consortium project used methods that "weren't really fit for consumption for a general audience," Phillippy said, largely because much of the assembly had to be done manually at NHGRI, "What Verkko did was take all of those lessons learned from that project and put it into an automated assembly workflow that roughly follows that same idea," he added.
While the NHGRI researchers said Hifiasm has recently been updated to also incorporate ultralong-read data from ONT, Verkko is the first assembler to do so that made it into an academic journal, according to Phillippy.
Mark Chaisson, a quantitative and computational biologist at the University of Southern California, noted that the initial T2T Consortium assembly was not scaleable or repeatable because of the manual intervention required. "Here we have a set of well-executed steps that essentially build and extend existing methods to do near T2T assembly in kind of a push-button fashion," he said of Verkko and the Nature Biotechnology paper.
Chaisson said that Verkko still needs to address "a few spurious gene duplications" before it is ideal for his own work in profiling genetic variation in population studies. However, he believes it is still a useful tool when paired with Purge_dups, an algorithm developed by Chinese and British bioinformaticians to remove extra, erroneous duplications.
"[Purge_dups], I think, will have a larger effect in terms of our ability to use these genomes than, say, the missing 26 fully assembled chromosomes," he said.
Chaisson, who is not a member of the T2T Consortium, expects to use Verkko in combination with Purge_dups for exploring genes with copy number variants, though he said the cost of using two types of high-coverage long-read sequencing is still too high for population-scale research projects.
Phillippy said that early adopters of Verkko are researchers who are interested in various kinds of repeats, including satellite repeats, segmental duplications, and rDNA arrays — particularly recent duplications on an evolutionary scale. He explained that more recent duplications have not had time to develop enough random mutations to make them "unique relative to the rest of the genome," making them more difficult to assemble.
According to him, it will take years for T2T genomics to make its way into clinical practice. "It basically requires a rebooting of all of the genomics we've done over the past two decades, doing it now again with long-read assays that can tap into these difficult-to-sequence regions of the genome," he said.
He expects the Verkko algorithm to support future telomere-to-telomere work on animal and plant genomes, based on feedback he has received from other research communities, including those for model organisms such as mice, zebrafish, and fruit flies. He also reported hearing from livestock and agricultural researchers.
"In the [Human Pangenome Reference Consortium], obviously, we'd like to get lots of human genomes done, but … other organism communities now all want their own T2T reference," he said. "Now we can give them the recipe."
Koren said that Verkko development and refinement is continuing, and NHGRI is now working closely with both PacBio and ONT, as well as with the HPRC. "If there's something that Verkko doesn't finish automatically, we essentially dig into every one of those," he explained.
"Sometimes it's data deficiency, sometimes there is a bug in the algorithm, or something we didn't think of. Those all get fed back, and we continuously make updates and releases so that the more and more genomes we do, the better we get," Koren said.
Phillippy said that it would take about a year or two for the long-read sequencing companies to "fix all of those last few weaknesses in the technologies," which will then make Verkko and similar algorithms more powerful.