NEW YORK – Researchers from the Human Pangenome Reference Consortium (HPRC) are closing in on which sequencing and bioinformatics methods they'll use to obtain the hundreds of genomes required for their project.
In doing so, they have assembled "one of the most complete diploid genomes to date, with roughly four gaps per chromosome on average," said Erich Jarvis, a researcher at Rockefeller University whose lab is co-leading HPRC's method evaluation efforts.
"These gaps are mostly in the centromeres and telomeres, repetitive regions that are hard to assemble. Although this will not be the final assembly for us and the community, it is near telomere-to-telomere and helped determine what algorithms and data type developments were needed for us to go on," he said.
Over a period of nine months, HPRC researchers evaluated about two dozen assembly methods, with the goal of capturing the full-length sequence of each of the 46 individual chromosomes of human somatic cells. They published the results of their bake-off on Wednesday in Nature.
"The clear winner for the graph-based approach in this paper was the hifiasm algorithm on Pacific Biosciences HiFi data," Jarvis said. This assembly was generated by a group led by Heng Li, a computational biologist at Harvard University.
A final diploid reference assembly was created using PacBio HiFi reads, Oxford Nanopore Technology ultra-long reads, Bionano Genomics optical maps, and Hi-C data.
In addition to detailing the assembly method, the researchers presented data on new biology revealed by that assembly, including a finding that half of the genetic diversity within a given cell can be found in the centromeres.
The study provides an "excellent and very thorough" illustration of the capabilities and benefits of the telomere-to-telomere approach, according to Pavel Pevzner, a professor at the University of California, San Diego and an expert on graph-based genome assembly algorithms. Pevzner was not involved in the study but is on the HPRC's advisory board.
"However, even with multiple technologies, it required some manual work," he noted. "If the goal is to bring 'complete genomics' to every lab, the next step is to turn this process into a push-of-a-button procedure, exclude the manual analysis, and possibly minimize the number of additional technologies with poorly understood sources of errors."
For the study, the HPRC team submitted assemblies of HG002, a sample collected as part of a trio under the Personal Genome Project, an effort to create publicly shared genomic datasets founded by Harvard University professor George Church.
The work is one of the last steps before the group digs into its goal of sequencing every chromosome from tip to tail in 350 human genomes in order to build a graph-based reference genome. The HG002 assemblies also build on the recent success of the Telomere-to-Telomere consortium's complete assembly of the haploid genome of a complete hydatidiform mole cell line.
"I think it's underappreciated that the genomes we've been generating are still incomplete," said Giulio Formenti, a postdoc in Jarvis' lab and one of the lead authors of the new study.
"Everyone is walking around with two genomes in their cells, so when you say 'a genome of an individual,' you're really talking about two genomes," Jarvis added. "Those two genomes can be quite different."
Existing genome assemblies have "smushed" those haplotypes together, mostly by necessity, because separating, or phasing, them is a huge challenge.
"It's like trying to put together two nearly identical jigsaw puzzles," Jarvis said. As with puzzle pieces, one could try to assign reads to one set or the other first. Alternatively, one could begin assembling pieces and check their phasing later.
Jarvis credits the success of Li's approach to phasing the haplotypes "simultaneously within the assembly graph as opposed to before or after." This made good assemblies even better.
Overall, the paper "sets the mark for what people should be aiming for with future assemblies," said Aaron Wenger, a bioinformatician at PacBio and a coauthor of the paper. The top-quality assemblies in the paper all use HiFi reads, he noted, suggesting it was "a key data type to get to this level of quality."
"The first set of genomes for the pangenome generated from 2020 up until today have been PacBio HiFi based," Jarvis noted.
He said it is hard to estimate the cost of the final diploid genome, as it took many iterations. "But, if we had to start over again, and did the same process with what we know now, my estimate would be under $20,000," he said.
The newly assembled genome also revealed some new biological findings. Previously, phased genomes revealed that about 2.1 percent of bases were different from one haplotype to the other. In this sample, that figure grew to 3.3 percent, including approximately 2.6 million single-nucleotide variants; 631,000 small structural variants (SVs); and 11,600 SVs greater than 50 base pairs. The extra variation was found mostly in repetitive areas, such as centromeres.
"That's an incredible amount of untapped diversity in the human genome or any animal genome," Jarvis said.
The team also found large differences in the number of gene duplications between haplotypes. Some of those duplications exist only in primates and are of genes that are highly expressed in the brain.
"It makes you think: If this is a primate-specific duplication and expressed in the brain, is it affecting people’s brains differently? Is it somehow affecting how the brain functions differently in different people?" said Jarvis, who also studies the molecular and genetic basis of spoken language.
The authors described their method as "semi-automated," because the final product required some manual error correction. "We could have avoided doing that, but the result wouldn't be as good," Formenti said. For now, there's a trade-off between generating the best possible assembly and having a fully automated process.
The automated part of the assembly took a few days, but the manual error correction took several people multiple weeks. The current method is also reliant on trio data to phase the haplotypes, which triples the amount of sequencing required. However, Jarvis and Formenti suggested that the field is close to being able to assemble genomes of this quality, or better, without manual labor or trio sequencing.
Jarvis noted that Li's group has used Hi-C chromatin conformation capture data for phasing and that Adam Phillippy's group at NHGRI has similarly used ultra-long reads from Oxford Nanopore Technology's platform.
"I think we’re close to being able to use that info to separate [haplotypes] without parental data," Formenti said.
Of course, better sequencing data would also help. "If you get perfect reads, you don’t have to correct them," Formenti said. "And with very long reads, there's less to assemble."
Both Jarvis and Formenti were hopeful that the HPRC would soon be generating telomere-to-telomere assemblies. "With each iteration, we’re getting better and better," Jarvis said.