NEW YORK – An international team led by investigators at the University of California, Santa Cruz and the National Human Genome Research Institute has assembled and started analyzing a complete, gap-free assembly for the human X chromosome — work that has uncovered large stretches of the sex chromosome sequence missed in the past. The group shared details on this effort in a paper published in Nature on Tuesday.
"We have never actually seen these sequences before in our genome, and do not have many tools to test if the predictions we are making are correct," co-first author Karen Miga, a genomics researcher at the University of California, Santa Cruz and co-lead of the "Telomere-to-Telomere" (T2T) consortium, said in a statement. "This is why it is important to have specialists in the genomics community weigh in and ensure the final product is high-quality."
For the study, members of the T2T generated deep ultra-long read Oxford Nanopore sequencing and several other datasets from a hydatidiform mole cell line. From there, they relied on de novo assembly methods, iterative polishing approaches, and manual finishing steps to put together a highly accurate de novo X chromosome assembly spanning almost 3 gigabases of sequence, including large stretches of repeat-heavy, difficult-to-sequence regions of the sex chromosome that have been missed in the past.
Despite decades-long efforts to put together a complete human genome, gaps remain in the GRCh38 reference genome, she and her co-authors noted, explaining that "no one chromosome has been finished end to end."
With that in mind, the team set out to assemble a version of the X chromosome that stretched from one telomere or end to the other, with no gaps, using a combination of high-coverage nanopore sequencing, Pacific Biosciences long-read sequencing, Bionano Genomics optical mapping, 10x Genomics and Illumina linked reads, and Hi-C interaction data.
After manually finishing the assembly, the researchers estimated that their de novo X chromosome assembly was more than 99.99 percent accurate, and covered 29 regions missing from the previous X chromosome reference assembly.
They reported that the X chromosome sequence spanned some 3.1 megabases of sequence from the centromeric satellite array region, along with sequences from pseudoautosomal parts of the chromosome and from so-called cancer-testis ampliconic gene families.
"Our results demonstrate that finishing the entire human genome is now within reach and the data presented here will enable ongoing efforts to complete the remaining human chromosomes," Miga and her co-authors wrote, noting that new X chromosome sequences generated for the study will be stitched into upcoming versions of the human reference genome.
With the newly available sequence data, the team was also able to retrace methylation patterns in complex portions of the chromosome — from satellite array sequences to complex tandem repeats — while getting a glimpse at portions of the chromosome that vary from one individual to the next.
"We're starting to find that some of these regions where there were gaps in the reference sequence are actually among the richest for variation in human populations, so we've been missing a lot of information that could be important to understanding human biology and disease," Miga said, noting that "it is important to start figuring out how these differences contribute to human biology and disease."