NEW YORK – An industry-led team has generated more than 150,000 whole-genome sequences from more than 150,000 UK Biobank participants, producing a dataset that is expected to have applications in the disease biology, treatment development, and drug safety fields, attendees heard at the American Society of Human Genetics annual meeting, held virtually last week.
There, investigators from the UK Biobank and several firms outlined the rationale for the genome sequencing project, along with some of the analytical approaches being used to interrogate the data and the types of analyses anticipated as the data becomes more widely available.
"It's been an ambitious sequence-generating project, and one that's been run during very trying circumstances over these past couple of years," explained UK Biobank Deputy CEO Mark Effingham. "As we come toward the end of the project with the first public release of data coming up in early November, this provides the chance to get some early insight about what these data are going to provide."
"It's very exciting times as these whole-genome sequencing data start to become available, and to be used by researchers all around the world for improving our understanding of human health," he added.
The genome sequencing effort was led by investigators at Amgen, AstraZeneca, GlaxoSmithKline, and Johnson & Johnson, with charitable and UK government funding. The genome set is expected to complement existing genetic, biochemical, biomarker, metabolic, and health records/health questionnaire data being collected for the broader UK Biobank cohort, comprised of some 500,000 participants.
"These data will be a very exciting addition to the UK Biobank dataset, adding further genetic characterization on the half million UK Biobank participants, beyond the genotyping and whole-exome sequence data that is already available," Effingham said.
As part of the same ASHG session, Bjarni Halldorsson, a researcher with Amgen-Decode Genetics, presented information on variants being called in 150,119 whole-genome sequences from the UK Biobank, sequenced to around 30-fold average coverage apiece. Of those, he explained, 90,667 genomes were sequenced at Decode and the remaining 59,452 genomes were sequenced at the Wellcome Sanger Institute using two more distinct analytical pipelines.
The team looked at the resulting 2.7 petabytes of data using four variant calling methods that collectively pointed to hundreds of millions of SNP and indels — most of them rare — along with some 12.5 million microsatellites and about 895,000 structural variants in the genome sequences, he explained.
Together with exome and genotyping data from the UK Biobank project, the genome sequences (which included more than three dozen parent-child trio genomes) also provided a look at the proportion of different mutation types, imputation patterns, false discovery rates, true positive calls, batch effects, variant calling biases, and so on.
The data also revealed participant subpopulations found in the UK Biobank, including large clusters of individuals with British/Irish, African/Caribbean, or South Asian ancestry, as well as some of the genetic variation and geographic distinctions between them.
For their parts, AstraZeneca's Katherine Smith, GlaxoSmithKline VP of Target Sciences John Whittaker, and Mary Helen Black from the J&J's Janssen Pharmaceutical discussed short tandem repeats, drug discovery, and drug safety analyses of the UK Biobank genomes, respectively.
Smith touched on the role that STR expansions in coding or noncoding parts of the genome can play in conditions ranging from Huntington's disease to Friedreich ataxia and fragile X syndrome, for example, and presented findings from predicted pathogenic STR profiling and phenome-wide STR association analyses done using more than 136,800 UK Biobank genomes.
She and her team have already identified three dozen phenotypes with apparent STR associations, including 10 conditions that seem to involve STRs falling outside of a chromosome 6 major histocompatibility complex region.
Smith cautioned that the initial PheWAS "did not identify any new associations of large effect between short tandem repeats greater than the [150 base pair] read length and binary phenotype." Likewise, she noted that some known phenotype-STR associations remain undetected, likely due to power, technical, and recruitment considerations.
Still, she argued that the newly sequenced UK Biobank genomes "are a valuable resource for studying short tandem repeat variation" and suggested that "this picture may change as the rest of the 500,000 participants undergo genome sequencing and as the participant phenotype records are updated over time."
GSK's Whittaker walked conference attendees through the impact that whole-genome sequencing-based genetic associations can have on efforts to unearth new therapeutic targets in drug discovery programs.
In particular, he and his colleagues have focused on rare variant associations in not only predicted protein-coding parts of the genome but also those found in noncoding regions, in relation to roughly 100 quantitative or binary traits or conditions phenotyped for the UK Biobank. In genome sequences from nearly 125,400 participants of European ancestry, that search confirmed several known rare variant associations and some potential new signals.
"Probably what it's going to be most useful for is telling us more about the biology and the causal genes that are driving signals we've already spotted with [genome-wide association studies]," Whittaker speculated. "But that in itself is tremendously useful."
Future genomic analyses and association studies will rely on an accurate understanding of the context of a given region, he noted, which in turn requires clear annotation of regulatory contributors in the genome and an understanding of variant interactions with one another. And the power to find previously undetected binary trait associations will continue to grow as still more genomes are sequenced from the wider 500,000 UK Biobank participant group.
"Clearly what we need to do is integrate all variation, including the structural or SNV variation," Whittaker said, explaining that "that's the key, really, to making good use of these data."
Black, who is a population analytics, computational sciences discovery, product development and supply researcher in R&D at Janssen, suggested that UK Biobank genomes and similar large genetic datasets may help in future drug safety assessment studies.
Along with experimental data, target expression analyses, published research, and other data, human genetics profiles are used to establish so-called "actionable target liability assessment plans" in pre-clinical stages of drug development and toxicology analyses, she explained. "Target liability assessments have increasingly leveraged human genetics evidence to inform overall safety risk and facilitate execution of de-risking strategies."
In the case of UK Biobank data, bringing in whole-genome sequences should add another layer to the genetic, phenotypic, lifestyle, and other profiles described for study participants in the past, Black suggested, including efforts to come up with new therapeutic compounds and to assess drug safety.
Indeed, her team has already developed a target liability assessment workflow that incorporates whole-genome sequence, exome sequence, sequence annotations and allele frequencies, clinical records, biomarkers, and other clues to find informative phenome-wide associations and loss-of-function mutations.
Such data can then be assessed alongside several more data types — from expression quantitative trait locus (QTL), methylation QTL, and protein QTL profiles to differential gene expression, colocalization, gene knockout, and animal model data — to search for causal contributors to adverse reactions, Black explained, using the case of a targeted inhibitor proposed for Parkinson's disease treatment as an example.
"We've shown that analytical workflows leveraging large-scale, population-based genomics data linked to clinical information, such as that available in the UK Biobank, can be used to inform drug safety studies," she said, adding that "the UK Biobank whole-genome sequencing consortium is generating whole-genome data on all 500,000 individuals [enrolled in the UK Biobank project], which combined with the vast constellation of phenotype data, will be available in 2022."