NEW YORK (GenomeWeb) – An international team led by investigators in the Netherlands has established a detailed map of structural variants, small insertions and deletions, and previously undetected sequences in the Dutch population.
As part of the Genome of the Netherlands (GoNL) project, the researchers used genome sequence data for almost 800 individuals from 250 Dutch families to map nine types of structural variation in a haplotype-resolved manner. The study, published yesterday in Nature Communications, uncovered millions of bases of novel sequence and roughly 1.9 million variants, including structural variants in linkage disequilibrium with SNPs that have been implicated in disease.
The authors of the study reasoned that "current reference panels contain single nucleotide polymorphisms, insertions and deletions of up to 20 [base pairs] in length but only a very limited number of structural variants larger than 50 [base pairs] in size."
To begin filling in gaps in the genetic variant catalogs, the researchers set out to assemble a high-quality, haplotype-resolved variant reference panel of structural and complex variants, using sequence data from families profiled for the GoNL study.
They analyzed Illumina whole-genome sequence data for 769 GoNL participants from 231 parent-child trios and 19 parent-twin families, sequenced to an average of almost 15-fold base coverage, focusing on nine types of structural variants or insertions and deletions, which they identified with help from a dozen variant detection tools. These variants included everything from deletions and duplications to inversions, mobile element insertions, and novel stretches of sequence.
More than one third of the variants detected in the analysis had not been described before, the team noted, and most could be confirmed by Sanger or targeted Illumina sequencing.
Within the structural variant set, for example, the team detected almost 20,000 deletions larger than 100 base pairs, nearly 1.1 million deletions smaller than 20 base pairs, and more than 24,000 deletions affecting 21 bases to 100 bases of sequence — mid-size deletions that appear to have been underrepresented in past variant analyses.
When the researchers narrowed in on reads that mapped discordantly, or did not map, to the human reference genome, they detected some 4.3 million bases of sequence not found in the GRCh38 reference genome assembly. At least 11 of those sequences seem to coincide with expressed transcripts, suggesting they may represent protein-coding genes.
The team went on to phase almost 1.8 million insertion and deletions in the Dutch genomes, along with 54,650 structural variants. A closer look at relationships between the structural variants and previously described SNP associations hinted that at least some of the structural variants could impact the regulation of genes implicated in human traits or disease.
"Difficulties remain in capturing large and complex structural variants, especially those in repetitive regions," the authors wrote. "Evolving third-generation single molecule and long read sequencing, and further methodological advances such as global genome map technology, may further improve the discovery, genotyping, and phasing of structural variants."