Skip to main content
Premium Trial:

Request an Annual Quote

Draft Human Pangenome Reference Shows the Way to Capturing More Human Diversity


NEW YORK A recently completed draft human pangenome reference aims to be the first step toward a reference genome that not only is more complete but also better reflects human diversity.

The current human reference genome, GRCh38, is a mishmash of sequences from different individuals, though about 70 percent of the sequence comes from just one person. "Obviously one human can't represent all the variation in humans," said Benedict Paten, an associate professor at the University of California, Santa Cruz, and a member of the Human Pangenome Reference Consortium.

Instead, the consortium plans to generate 700 reference-quality haplotypes from 350 individuals, maximizing genomic and geographical diversity.

The idea of a pangenome reference that encompasses a wider range of human diversity is appealing, according to Fritz Sedlazeck, an associate professor at Baylor College of Medicine, who was part of the Telomere-to-Telomere Consortium that recently generated a continuous haploid human genome sequence, T2T-CHM13. He adds that a diverse pangenome reference could uncover genomic regions or even genes that are not represented in GRCh38.

So far, the Human Pangenome Reference Consortium has generated a draft reference of 94 de novo haplotype assemblies from 47 individuals. As the researchers reported in a preprint in BioRxiv in July, they generated these assemblies using a combination of Pacific Biosciences high-fidelity and Oxford Nanopore long-read sequencing, Bionano Genomics optical maps, and high-coverage Hi-C Illumina short-read sequencing. These assemblies, they reported, cover more than 99 percent of the expected sequence and are more than 99 percent accurate at the structural and base-pair levels.

But the 47 individuals represented in this draft human pangenome reference all hail from the 1,000 Genomes Project.

By first focusing on individuals from that project which represents 26 global populations the consortium aimed to both improve their sequencing and assembly approaches and enrich the genetic diversity represented by the reference, said Eimear Kenny, a professor at the Icahn School of Medicine at Mount Sinai and a consortium member.

"A lot of work was happening on not only assessing [and] comparing technology and figuring out how different technologies could be knit together for a better representation of a genome, but also how that [technology] moves through pipelines in a production way that meets standards of quality," she added.

This work was enabled by the 1,000 Genomes Project individuals, who had provided consent allowing open access to their genomes. The researchers also had access to their parental genomes, which helped for phasing the assemblies.

Going forward, the consortium plans on bringing in sequencing data from other biobanks, as well as from additional populations and participants. For instance, the researchers have been contacting individuals from the Icahn School of Medicine at Mount Sinai's BioMe BioBank program about participating in the pangenome effort. BioMe participants, Kenny noted, are unselected from the Mount Sinai health system and reflect the diversity of New York City.

At the same time, the consortium is also partnering with other researchers and reaching out to populations around the world to participate. "We really, really, really want this to be a reference for all humanity. We want this to be representative, as far as possible, of as much of the population as we can," Paten said.

But the field is not always trusted by underrepresented groups. "We also recognize that genetics doesn't have a great history [with respect] to marginalized populations," he noted.

To address those issues, the consortium has an ethical, legal, and social implications working group embedded. During the first phase of the pangenome project, that group has been reaching out to other consortia and partners for ideas on the best frameworks for recruiting participants and the best models for consent. As the project is asking participants to openly share their data, Paten said, the consent needs to be ethical and respect participants.

"What we're trying to do in phase one is assess the types of models that are out there," Kenny said. "In phase two, we really want to have a principled way to generate evidence for what works [and] what doesn't work."

She noted, though, that some groups may opt not to participate, or to participate on their own terms.

Meanwhile, with the 1,000 Genomes Project participant data, the researchers tested different graph assembly approaches to present their draft pangenome reference. Paten pointed out that despite the millions of variations they contain, human genomes are actually quite similar, and a graph approach is a way of describing the relationships between them.

The group used three different graph assembly approaches: Minigraph, Minigraph-Cactus, and the Pangenome Graph Builder. Each of these has different nuances to them, Sedlazeck noted, with one being comprehensive, the other focusing on SNVs, and the third on structural variants in pangenomes.

"Structural variants are rather complex and often hard to identify," Sedlazeck said. "Encompassing this kind of ethnicity-unique or just unique regions is really helping us to understand the diversity of these regions that are not represented in GCh38."

For instance, the pangenome researchers were able to annotate and visualize the structure of five multiallelic CNV loci including the variable HLA-A region. Further, around HLA-A, they noted two previously reported deletion alleles but also homed in on a previously unreported insertion allele carrying an HLA-Y pseudogene. This insertion, they noted, occurred at high frequency, 28 percent, but was not seen in GRCh38.

"If it's not represented in GRCh38, obviously this hinders the research on this," Sedlazeck noted.

The consortium also mapped the annotated gene list from GRCh38 to the pangenome assemblies to find that all the genes contained in GRCh38 are also well represented there. Sedlazeck said, though, that it would have also been interesting to know what new genes or isoforms are in the assemblies, for example in large insertions.

Paten added that there is ongoing work to fill in the gaps in the assemblies with an eye toward generating telomere-to-telomere assemblies, as well as to improve the alignments and tools. Still, he noted there is already plenty of interesting genome biology to explore in the current assemblies.

"This is such a foundational resource for the entire field," Kenny added, noting that there are likely many unanticipated benefits from having a modern and diverse pangenome.