NEW YORK – UK Biobank investigators at Amgen subsidiary Decode Genetics, Reykjavik University, and other centers have shown that the vast collection of genetic variants revealed with whole-genome sequencing in more than 150,000 of the study's participants can improve efforts to find informative trait or disease associations.
"The large-scale sequencing described here, as well as the continued effort in sequencing the entire [UK Biobank cohort], promises to vastly increase our understanding of the function and impact of the noncoding genome," first and co-corresponding author Bjarni Halldorsson, a researcher at Decode Genetics and Reykjavik University, and his colleagues wrote in Nature on Wednesday.
"When combined with the extensive characterization of phenotypic diversity in the [UK Biobank]," they explained, "these data should greatly improve our understanding of the relationship between human genome variation and phenotype diversity."
As they reported at the American Society of Human Genetics annual meeting last year, members of the UK Biobank team at Decode and the Wellcome Sanger Institute performed whole-genome sequencing — to an average depth of more than 30-fold coverage — on 150,119 of the study's 500,000 participants. The effort was supported by firms such as Amgen, AstraZeneca, GlaxoSmithKline, and Johnson & Johnson, as well as the UK government.
In the newly published paper, UK Biobank researchers from centers in Iceland and Denmark described single nucleotide variants, small insertions and deletions, and larger structural variants found in the data, while highlighting three main ancestry clusters and related haplotype features.
"We define three cohorts within the UK Biobank: a large British Irish cohort, a smaller African cohort, and a South Asian cohort," the authors reported, noting that "[a] haplotype reference panel is provided that allows reliable imputation of most variants carried by three or more sequenced individuals."
The team's search led to more than 585 million single nucleotide variants, more than 58.7 million indels, and almost 895,100 structural variants, along with more than 2.5 million microsatellites — a collection that was used to search for rare variants influencing conditions ranging from type 1 hemiplegic migraines or myotonic dystrophy to epilepsy, episodic ataxia type 2, or spinocerebellar ataxia type 6 within and across the genetic ancestry-based cohorts.
"Using this formidable new resource, we provide several examples of trait associations for rare variants with large effects not found previously through studies based on whole-exome sequencing and/or imputation," the authors reported.
The team noted that association analyses appeared to get a boost by distinguishing between parts of the genome with or without genetic diversity between individuals with similar ancestral backgrounds. In particular, an analysis of a so-called depleted region (DR) score — representing regions that were relatively devoid of genetic diversity — suggested that strong conservation often turns up outside of protein-coding portions of the genome covered by exome sequencing.
"We expect the DR score presented here to be an important resource for identifying genomic regions of functional importance, although further evaluations should be taken to understand its properties and implications and how it compares to other measures of conservation and sequence constraint," the authors explained.
The investigators reportedly plan to sequence the genomes of all 500,000 UK Biobank participants in the coming years. Individuals enrolled in the study have already been assessed using exome sequencing, phenotypic profiling, and other approaches.
"Data of this type and quantity are going to revolutionize our ability to identify and characterize intergenic sequences of importance to human diversity, be it to risk of disease and response to treatment or some other attributes," senior and corresponding author Kari Stefansson, Decode founder and CEO, said in a statement.