PALM SPRINGS, Calif. — Researchers from the Telomere-to-Telomere (T2T) Consortium have generated an assembly of a complete human reference genome that could lead to better variant calling in the clinic and inform new studies of cell biology.
The results of the project were presented by Karen Miga, an investigator at the University of California, Santa Cruz, at the Association of Biomolecular Resources Facilities annual meeting on Wednesday.
Though the completion of the human genome was first announced in 2003, gaps have remained. Between 8 percent and 10 percent of the human genome has remained opaque, especially centromeres and short arms of chromosome.
With a combination of Oxford Nanopore and Pacific Biosciences sequencing and other approaches, the T2T consortium generated a gapless and highly accurate human reference genome dubbed T2T-CHM13. According to Miga, this new reference improves variant calling, including of medically important variants; provides novel insights into duplicated gene families; and points to previously unknown properties of the kinetochore.
"This will launch a new era where it will be no longer acceptable to only survey a small portion of our genome," Miga said.
The new T2T-CHM13 assembly, which was presented last year as a preprint and is currently in press with a journal, includes 200 million bases that are not present in other references, nearly 2,000 new genes, and 115 genes predicted to be protein coding. It additionally provides a complete map of centromeres.
In particular, the consortium sequenced the genome of a hydatidiform mole — in which the maternal genome is lost and only the paternal one remains — using a combination of long Oxford Nanopore reads and high-fidelity PacBio consensus reads. It then used a string graph approach and rounds of error correction to further improve accuracy. The assembly has a Q score of 73, or about one error every 10 million bases.
Because of this higher accuracy, Miga noted, variant calling will be improved. She and her colleagues mapped 3,000 high-coverage Illumina samples from the 1,000 Genomes Project to both the current human reference genome and their new assembly. When mapped to the T2T-CHM13 assembly, they found hundreds of thousands of new variants per sample while also uncovering tens of thousands of spurious variants. They further noted a 12-fold reduction in false positive calls in medically relevant genes.
This improvement is in part due to the current reference genome, hg38, being a composite of different individuals and a mix of sequences that represent both European and African ancestry. This has introduced linkage disequilibrium discordance, or segments from different ancestries that come together in ways that are not generally seen in the human population. Additionally, Miga said that some genes were missing, while others were in incorrect configurations in hg38.
Miga added that this extends to medically relevant genes such as those involved in profound hearing loss and muscle paralysis. The consortium, she noted, is working with the Genome Reference Consortium to do rapid updates to these medically relevant genes.
At the same time, T2T-CHM13 provides further insights into the biology of the centromere and kinetochore and how the genome is organized. The centromere is important for chromosomal segregation, including during early development, as well as in aging and cancer. By combining their genome assembly with protein and epigenetic data, Miga and her colleagues also found that the kinetochore tends to form where the youngest sequences are. "This is kind of like magma coming up to the Earth's surface and pushing things out," she said. The centromere and kinetochore also sometimes harbor duplications.
"The future work, of course, will be aimed at understanding how this genome and the variation that one can study in mapping to this genome … can tell us about new biology and function," Miga said.
According to Miga, this is also just a first step. The T2T Consortium is working with the Human Pangenome Reference Consortium to make these types of complete genome assemblies routine, as well as to generate additional human reference genomes that reflect the diversity of people across the world.