NEW YORK – Researchers from the Telomere-To-Telomere Consortium shared the first insights from analyzing the gapless human complete hydatidiform mole genome assembly this week at the American Society of Human Genetics annual meeting, held virtually.
The assembly, which was first released as a preprint earlier this year, includes more information than initially thought, with about 200 Mb more content than any other assembly, including more than 2,000 new genes, of which 115 are predicted to be protein coding. The researchers found previously unobtainable data on features such as centromeres, satellite tandem repeats, acrocentric short arms, and segmental duplications. Five new chromosome arms are visible for the first time, containing 66.1 Mb of new sequence, said Adam Phillippy, T2T co-chair and head of genome informatics at the National Human Genome Research Institute, during a conference session on Tuesday. These account for most of the new genes, including 879 ribosomal RNA genes.
"There's more questions than answers revealed by this genome," said Nicolas Altemose, a postdoc at the University of California, Berkeley, who presented a genome-wide analysis of satellite tandem repeats in chromosome centromeres during the same session. But using the assembly, his team was able to provide evidence for the so-called "layered expansion" model of centromere evolution, where regions further from the centromere core represent older transposable elements. "These molecular fossils can tell us how old different layers of the centromere are," he said.
About 81 Mb of the new data were associated with segmental duplications, long sections of DNA on different chromosomes that share more than 90 percent sequence identity and house about half of all copy number variants. About 35 Mb of those were on acrocentric short arms, and 182 of the new genes are associated with segmental duplications, said Mitchell Vollger, a doctoral student in Evan Eichler's lab at the University of Washington, who also presented during the session.
Additionally, the assembly contains approximately 3 million more CpG sites than Genome Reference Consortium Human Build 38 (GRCh38), which are more likely to be methylated, according to Ariel Gershman, a doctoral student in Winston Timp's lab at Johns Hopkins University, another presenter.
The findings should have implications for fluorescence in situ hybridization probe design and even clinical genomic testing, the researchers said.
The T2T Consortium announced the first gapless human genome assembly — minus the Y chromosome — in June, bringing the field tantalizingly close to a truly complete representation of life's cookbook. They had been working on the project since 2018 and delivered a gapless assembly of the X chromosome in 2020 based on long reads from Oxford Nanopore Technologies' platform, but switched to a strategy that employed mostly HiFi reads from Pacific Biosciences, with help from ultra-long nanopore reads.
The conference talks from Phillippy, Altemose, Vollger, and Gershman represent about half of the new papers that are slated to come out of the work on T2T-CHM13 so far. Papers on genetic variation and the transcriptional and epigenetic state of repeat elements are also in the works, they noted.
The analyses led by Gershman and Vollger were posted on BioRxiv in May, while Altemose's analysis was posted in July.
One unique finding mentioned by both Altemose and Gershman was that all human centromeres appear to have regions with lower CpG methylation, which are also associated with centromere protein A, or CENP-A, binding.
"CENP-A [binding regions] tend to overlap younger, more recent, expanding sequences," Altemose said. "There are lots of interesting questions about why these regions coincide." There could be "neutral" explanations for why this might occur, he said, but noted it's possible the kinetochore, the complex of proteins associated with the centromere, is playing an active role in selecting sequences that preferentially bind the centromere.
The hypomethylated centromeric dip region, or CDR, occurs in all centromeres, Gershman said, a fact that was validated using different samples: HG002, as well as cell lines of diverse lineages from the 1000 Genomes Project. Centromeres are highly variable, and the CDR matched up with different parts of the arrays of satellite repeats in every individual the researchers looked at.
"This is the first look at population-level epigenetic variability in human centromeres," Gershman said.
The results have several clinical implications. Vollger provided an analysis of a tandem repeat domain in the gene lipoprotein A, or LPA, one of the genes most associated with coronary heart disease risk. "It's its repeat content that's really important for its risk factor," he said, with lower copy number being associated with higher risk. Looking at 20 different haplotypes, they found both copy number variation and coding variation in LPA.
During the Q&A session, Altemose was asked if clinical labs should redesign their FISH probes based on the new reference genome. "The short answer is yes, absolutely," Altemose said. "We have the power to develop very specific probes for FISH or Cas9 [genome] editing."
"We're uncovering a whole new field of probe design," Phillippy added.
As for variant calling, "a lot of the variants we know the effects of exist in GRCh38," Phillippy said. "Those variants still work," and the fact that there are validated assays to find them means he sees value in using both reference genomes.
"A reference is only as good as the resources associated with it," he said. "We see it as a big need of the community to develop lift-over tools, that allow us to translate between these coordinate systems."
Phillippy also said that a gapless human Y chromosome assembly is in the works and should be available in the next several months.