NEW YORK – Using a plethora of sequencing technologies and computational tools, researchers from the Human Genome Structural Variation Consortium (HGSVC) have assembled dozens of near-complete human genomes to elucidate complex structural variants in the human genome that were previously deemed intractable.
The researchers hope that the database, published in a preprint in BioRxiv in September, can serve as a resource for the scientific community to further explore the biomedical relevance of complex variants in the genome.
The findings also add to data previously released by the international consortium, which is funded by the US National Institutes of Health and aims to systemically catalog structural variants, primarily using samples from the 1000 Genomes Project.
"This is the first study [for the consortium] where we have a sizable number of telomere-to-telomere chromosomes," said Jan Korbel, head of data science at the European Molecular Biology Laboratory (EMBL) and one of the corresponding authors of the preprint. "Our aim is to understand structural variation throughout the genome, in particular in regions where the structural variation is more complex."
For their study, the HGSVC researchers sequenced 65 human genome samples that represented five continental groups and 28 populations, generating 130 haplotype-resolved genome assemblies. Sixty-three of these samples were from the 1000 Genomes Project, with the remaining two from the International HapMap Project and the Genome in a Bottle Consortium.
The broad consent granted for the use of these samples enables the consortium to distribute the results openly, including primary sequencing data as well as structural variant calls, Korbel noted.
To construct the genome assemblies, the study deployed both HiFi sequencing from Pacific Biosciences as well as nanopore sequencing from Oxford Nanopore Technologies, leveraging the former's high accuracy and the latter's ultra-long-read capabilities, Korbel said.
On average, the study achieved 47X coverage per sample with PacBio HiFi sequencing using the Sequel II or Revio platforms and 56X coverage for nanopore sequencing using the Oxford Nanopore PromethIon device and R9.4.1 flow cells. For nanopore sequencing, the average coverage depth for ultra-long reads — reads that are longer than 100 kb — was 36X per sample, according to the study.
Additionally, the authors performed single-cell template strand sequencing (Strand-seq), optical genome mapping, Hi-C sequencing, isoform sequencing (Iso-seq), and RNA sequencing.
The HGSVC team constructed haplotype-resolved assemblies using Verkko, an automated hybrid genome assembly algorithm developed by researchers at the National Human Genome Research Institute (NHGRI). The phasing signal for the assembly process was generated using Graphasing, which leverages Strand-seq data to globally phase assembly graphs, allowing researchers to produce chromosome-scale de novo haplotypes for diploid genomes without parental sequencing data.
In certain challenging genomic regions, such as centromeres or the Yq12 region, the researchers also supplemented Verkko with Hifiasm, a de novo assembly tool developed by Dana-Farber Cancer Institute researcher Heng Li and his team.
By using two long-read sequencing technologies, the study authors said they were able to close 92 percent of previously reported gaps in genome assemblies that used only PacBio HiFi reads. Moreover, they achieved telomere-to-telomere status for 39 percent of the chromosomes analyzed in the study.
In these near-complete genomes, the HGSVC researchers identified 188,500 structural variants (SVs), 6.3 million indels, and 23.9 million single-nucleotide variants (SNVs) by comparing them against the T2T-CHM13v2.0 reference. When using GRCh38-NoALT as a reference, the researchers cataloged 176,531 SVs, 6.2 million indels, and 23.5 million SNVs.
As part of the study, the researchers also delved into many disease-associated genomic regions, where structural variants had not been comprehensively studied due to their challenging sequences.
One such analysis focused on the 5 Mb Major Histocompatibility Complex (MHC) region. After analyzing 130 complete or near-complete MHC haplotypes, the researchers identified 170 SVs that had not been previously reported. They also uncovered a previously unknown copy number variant — a deletion of HLA-DPA2 on one haplotype — as well as low-frequency gene-level SVs, such as a deletion of MICA on one haplotype.
Another disease-relevant and structurally complex part of the genome is the region containing the SMN1 and SMN2 genes, which are implicated in spinal muscular atrophy (SMA). During the study, the researchers were able to assemble, validate, and profile two-thirds of haplotypes in that region, fully resolving the structure and copy number of SMN1/2, SERF1A/B, NAIP, and GTF2H2/C.
Lastly, the HGSVC team sought to tackle centromeres, often considered the most structurally challenging regions of the human genome due to α-satellite tandem repeat DNA. They completely assembled and validated 1,246 human centromeres, uncovering 4,153 new α-satellite high-order repeat (HOR) variants and novel array organization among the active α-satellite HOR arrays.
"For me, this [study] is really exciting," said Danny Miller, a physician-scientist and nanopore sequencing expert at the University of Washington. "I think it shows that we can now consistently and reproducibly resolve complex variations using long-read sequencing."
Additionally, Miller, whose team is currently applying nanopore long-read sequencing in a separate study reanalyzing 1000 Genomes Project samples to build a comprehensive structural variant catalog, said the paper will help researchers gain a better understanding of the structural variants in some of the most challenging regions of the genome.
For instance, the HGSVC researchers demonstrated the diversity of haplotypes spanning the SMN region. With such information, clinicians can now start to ask whether there are individuals who are more susceptible to having an SMN1 deletion or other mutational events, he noted.
Miller also applauded the authors' efforts to investigate structural variants in challenging genomic regions such as centromeres. Their findings will help other researchers generate hypotheses and study the clinical relevance of these SVs moving forward, he said.
According to Korbel, the data for the current study are available on the International Genome Sample Resource (IGSR) server hosted by EMBL. The consortium also plans to share the data, including the raw sequencing reads, on the Amazon cloud to facilitate computing, he noted.
Despite the progress made by the HGSVC to fill the gaps in the human genome, there are still some thorny regions remaining where the team was "underpowered to see everything," Korbel said. Most of these unresolved segments are in the acrocentric short arms of chromosomes 13, 14, 15, 21, and 22, he said, which are known to undergo extensive ectopic recombination and have the highest degree of sequence homology.
Korbel noted that his team will collaborate with the Human Pangenome Reference Consortium (HPRC) to further tackle these remaining dark spots of the genome moving forward.
"As the sequencing quality goes up further, we will immediately look into those regions that we are currently still not able to fully resolve to see what they reveal to us in terms of structural variants," Korbel said.