NEW YORK – A team led by Washington University in St. Louis researchers has profiled rare and ultra-rare structural variants in nearly 18,000 high-coverage whole-genome sequences in an effort to fill in remaining gaps in the understanding of the larger variants that impact protein-coding and non-coding portions of the genome.
Reasoning that "tools and resources for the study of [structural variants] have lagged behind those for smaller variants," the researchers relied on an open-source and scalable analytical pipeline centered on existing svtools software to search for insertions, deletions, duplications, inversions, and other structural variants in genomes from 17,795 individuals with European, African, or Latino ancestry, along with their predicted impacts on gene or non-coding element dosage.
The participants included cases or controls enrolled through the National Human Genome Research Institute Centers for Common Disease Genomics program, the Population Architecture Using Genomics and Epidemiology (PAGE) consortium members, the Simons Genome Diversity Panel, or other projects, and each genome was sequenced to at least 20-fold coverage, the team noted in a paper published in Nature on Wednesday.
"The sample size and use of deep [whole-genome sequencing] allowed us to map rare [structural variants] at high genomic resolution and estimate the relative burden of deleterious [structural variants]," senior and corresponding author Ira Hall, a genetics and medicine researcher affiliated with the Washington University School of Medicine and the center's McDonnell Genome Institute, and his co-authors wrote, noting that the work represents the largest genome sequence-based analysis of human structural variants done so far.
"We publicly release site-frequency data to create the largest [whole-genome sequencing-based structural variant] resource to date," the authors added, though they cautioned that the algorithms available so far likely under-represent some repetitive structural variants such as mobile element insertions, short tandem repeats, and multi-allelic copy number variants.
From the 4,442 structural variants found in each genome, on average, the researchers saw deletions turning up most frequently, followed by insertions stemming from mobile elements, and tandem duplication-related structural variants. Each participant had an average of 2.9 rare structural variants in protein-coding portions of the genome, explaining anywhere from 4 percent to more than 11 percent of the rare, high-impact alleles that have been described in genes in the past.
Expanding from that, the team estimated that some 17 percent of rare loss-of-function variants in the protein-coding genome may be traced back to structural variants. On the non-coding sequence side, each genome harbored more than 19 deletions apiece, on average — variants that appeared to have an outsized impact on disease risk based on the proportion of rare, non-coding deletions that were classified as deleterious.
"Noteworthy is the burden of rare, strongly deleterious non-coding deletions apparent in our dataset," the authors wrote, noting the such findings "indicate that comprehensive assessment of [structural variants] will improve power in rare variant association studies."
Beyond those analyses, the investigators characterized almost 159,000 ultra-rare structural variants, and took a closer look at the broader gene dosage consequences of the structural variants mapped to coding and non-coding genome sequences.
"At genes, our results complement existing estimates from exome sequencing and microarray data," they reported. "At non-coding elements, we observe strong correlations with measures of nucleotide conservation, purifying selection, regulatory element activity, and cell-type specificity."