NEW YORK – In a collection of studies published on Thursday, researchers outline the genome sequence, genetic variant, epigenetic, expression, and other insights being gleaned from a complete "telomere-to-telomere" (T2T) version of the human reference genome.
The gapless human genome assembly, dubbed T2T-CHM13, spans more than 3 billion base pairs and contains almost 200 million bases of sequence that were missing from the human genome in iterations released since the original human genome was released more than two decades ago, members of the T2T Consortium explained.
"The T2T-CHM13 assembly adds five full chromosome arms and more additional sequence than any genome reference release in the past 20 years," study authors wrote in Science on Thursday. "This 8 percent of the genome has not been overlooked because of a lack of importance, but rather because of technological limitations."
Corresponding authors included Adam Phillippy, a genome informatics researcher at the National Human Genome Research Institute; Karen Miga, from the University of California at Santa Cruz; and Evan Eichler of the University of Washington.
The team relied on a variety of technological approaches — including long-read sequencing approaches such as circular consensus sequencing, short-read sequencing, BioNano Genomics optical mapping, Strand-seq, and other methods — to generate new data on a complete hydatidiform mole sample that was subsequently brought together with the up-to-date assembly strategies.
"As reported in this exciting paper, by leveraging PacBio HiFi sequencing technology, the T2T Consortium has achieved a goal that scientists around the globe have been pursuing for more than 30 years — a truly complete human genome sequence, one that can potentially help create the basis for a complete medical genome to better human health," PacBio President and CEO Christian Henry said in a statement.
"The complete, telomere-to-telomere assembly of a human genome marks the next era of genomics and opens up huge research potential in human health and disease," Oxford Nanopore Technologies CEO Gordon Sanghera said in a statement, noting that the company's long-read sequencing technology contributed significantly to the work.
Among other additions, the T2T-CHM13 assembly houses new satellite repeat sequences, segmental duplications, and some 1,956 predicted gene sequences not found in previous genome assemblies, including 99 predicted protein-coding genes, the researchers noted. It also contains known sequences that were not accurately mapped to a reference assembly in the past.
"T2T-CHM13 includes gapless telomere-to-telomere assemblies for all 22 human autosomes and chromosome X, comprising 3,054,815,472 bp of nuclear DNA, plus a 16,569-bp mitochondrial genome," the authors wrote.
Even so, they cautioned that while the assembly "represents a complete human haplotype, it does not capture the full diversity of human genetic variation. To address this bias, the Human Pangenome Reference Consortium has joined the T2T Consortium to build a collection of high-quality reference haplotypes from a diverse set of samples."
For one of the accompanying papers in Science, investigators from Johns Hopkins University, Cold Spring Harbor Laboratory, and other centers highlighted the improved genetic variant analyses that are possible with the completed T2T-CHM13 assembly. They unearthed a slew of new variants — and ruled out thousands of others — by mapping more than 3,200 genomes generated for individuals from 17 human populations for the 1000 Genomes Project to T2T-CHM13.
"We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery," authors of that study wrote. "Simultaneously, this reference eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 medically relevant genes by up to a factor of 12."
A UCSC and Johns Hopkins-led team focused on DNA methylation profiles, DNA accessibility, chromatin immunoprecipitation sequence mapping, and other epigenetic features found across the T2T-CHM13 assembly, including maps covering some 32.3 million CpG methylation sites.
"The improvements in epigenetic profiling using T2T-CHM13 set the foundation for complete assemblies and long-read epigenetics for major biological advancements," the authors wrote, noting that the work "marks the start of exploration into duplicated and repetitive portions of the epigenome, pioneering the exploration of epigenetics in a complete human genome."
Still other Science studies outlined some of the epigenetic, transcriptional, regulatory, and gene expression features found in specific parts of the genome, including centromeric sites and repeat elements, while providing insights into the segmental duplications revealed through comparisons to the newly complete human reference assembly.
In Nature Methods, meanwhile, members of the team outlined the technological and computational approaches used to complete, polish, and validate the genome assembly. The T2T Consortium and UCSC's Miga reported on the dramatic human reference genome upgrade at the Association of Biomolecular Resources Facilities conference this week.
"The work to date by the T2T Consortium is an important part of this vision, because it provides the blueprint by which routine, de novo assembly of individual genomes will be possible," Deanna Church, VP of Inscripta's mammalian business area and software strategy, wrote in her own Science piece.
"The continued work of the T2T Consortium, along with the Human Pangenome Reference Consortium, which aims to produce high-quality assemblies from diverse human populations, will provide additional protocols for routine, diploid assemblies," Church explained, "as well as the data structures and tools needed to produce a reference assembly that can represent all possible sequences in a population."