NEW YORK – The UK Biobank said on Thursday that it has made whole-genome sequencing data for its half a million participants available to approved researchers worldwide.
The newest release represents the world’s largest single set of sequencing data so far, the organization said, opening the door to new research such as the molecular underpinning of diseases and drug development.
"What is special about this [dataset] is the combination of sequencing data with all of the other information in UK Biobank and the scale of UK Biobank," said Rory Collins, the organization’s principal investigator.
According to Collins, the release of WGS data is a sequel to the previous genetic datasets it has made available to global researchers. In 2017, the UK Biobank finished releasing genotyping data of all of its 500,000 participants, he said, followed by exome-sequencing data of the whole cohort in the years after. About a year ago, the organization unveiled as a pilot whole-genome data of the first 200,000 participants, which are now superseded by the current release, he added.
Overall, the project cost approximately £200 million ($254 million), Collins said, for which the UK Research and Innovation (UKRI), a government agency, and the Wellcome Trust covered about half of the expenses. In addition, the four industry partners each provided £25 million, covering the other half of the costs.
Whole-genome sequencing was contracted to Decode Genetics, an Amgen subsidiary, and the Wellcome Sanger Institute and involved the use of the Illumina NovaSeq platform, with an average sequencing depth of 30X, Collins said. In addition, the industry partners provided quality control and informatics support to make the data "more research-ready," he noted.
Additionally, as part of the charitable funding provided by the Wellcome Trust, the UK Biobank received a £20 million grant to develop a cloud-based analysis platform for the WGS data, Collins said. US data analysis firm DNAnexus won that contract and helped develop the so-called UK Biobank Research Analysis Platform, which is hosted by Amazon Web Services (AWS) in the London region, he said.
"The reason for linking [the cloud platform] with the whole-genome sequencing program was the scale of the data," Collins said. "Clearly, we're sequencing on a large scale, and moving it around to researchers is not feasible."
The UK Biobank’s WGS data is only available to researchers through the cloud environment, he noted, where researchers can also access other data types that have previously been made available from the biobank.
Collins said he also believes the use of the cloud analysis platform can help "democratize" the UK Biobank’s data by alleviating the local computing burden for researchers. In that regard, he noted that AWS offered $500,000 worth of free computing power every year to early-career or low- and middle-income country (LMIC) researchers who wish to use the platform.
Once the sequencing was completed, the four industry partners jointly carried out variant calling of the genomic data using the Illumina DRAGEN pipeline. The companies plan to publicly share their summary statistical analysis, including genome-wide association results, the UK Biobank said. Additionally, the firms were given nine months of exclusive access to the new data in return for their investment.
In an email, a GSK spokesperson said the company is using the WGS data to identify new therapeutic targets.
"A major challenge in drug development is that a large proportion of apparently promising medicine hypotheses are ultimately not successful in benefiting patients," the spokesperson said. "Having these insights from genetic, biological, and clinical data allows us to focus efforts on developing therapies with the highest chance of being successful, to get more effective medicines to patients faster."
According to the spokesperson, a "real strength" of the UK Biobank data is the combination of WGS data with both detailed health records and "a wealth of deep biological data," including blood proteomics. "This has helped us to predict the consequences of taking a new drug, so that we can identify early whether individuals are responding to that drug in the way we expect, and to identify individuals that may be most likely to benefit from a specific therapy," she added.
With the WGS data now available to the broader research community, Collins said the goal is to help propel the understanding of the biological underpinnings of disease, hopefully leading to more targeted drug discovery as well as the development of precision medicines.
To gain access to the UK Biobank database, Collins said researchers will need to complete an application where they will indicate their study interests, and the UK Biobank will vet their scientific credibility. Currently, there are 30,000 researchers worldwide already registered to use the database, he said.
"We're not trying to restrict the ways in which they work or the ways in which they approach the data," he said. "But we do want to be assured that they will be doing health-related research in the public interest."
As part of the application process, a researcher's institution is also required to sign a contract with the UK Biobank, agreeing to the terms and conditions of data usage and safety. Once granted access, any findings pertaining to the biobank datasets are also expected to be made publicly available, Collins said.
Additionally, researchers are obligated to provide an annual report to the UK Biobank indicating their current efforts with the data as well as any publications produced. At the end of the project, users are required to delete the data, he added.
The UK Biobank currently charges researchers a tiered access fee, Collins said, adding that the cost to obtain all the datasets, including the WGS data, is around £9,000 for three years. However, to promote equal access, the organization has secured grants to reduce or waive the charges for early-career researchers or those from low and middle-income countries.
Besides genetic data, Collins said the UK Biobank has also been collecting data on participants’ lifestyle, whole-body imaging scans, and health information, as well as blood protein markers.
Specifically, the organization is currently completing magnetic resonance imaging of 100,000 participants to build a longitudinal imaging dataset, an effort funded by the UK Medical Research Council (MRC), Calico Life Sciences, and the Chan Zuckerberg Initiative (CZI).
In October, researchers from Biogen and their collaborators also published results characterizing the plasma proteomic profiles of 54,219 UK Biobank participants as part of the Pharma Proteomics Project.
The goal of the UK Biobank moving forward is to continue expanding proteomic data analysis to cover all 500,000 participants. Additionally, "epigenetics is another area of course to be thinking about," Collins said.
Moreover, the UK Biobank has "ongoing conversations" with large population studies across the world such as the All of Us research program in the US, he said, to foster ways for cross-cohort analyses and data access. However, there will be logistical constraints for such endeavors, he added, especially when they involve moving data across borders.
"UK Biobank is not the answer, clearly, to all questions," he said. "I think there is a real need to address the technological issues of how we do combined analyses across these large cohorts in different populations."