CHICAGO – UK Biobank is developing its own data analytics platform in hopes of making its vast dataset more accessible to researchers worldwide.
The national-scale sequencing and longitudinal health research program said this week that it has chosen DNAnexus and Amazon Web Services (AWS), following an open bidding process, to build the platform to manage its increasingly massive database, which is expected to grow to 15 petabytes within five years. The analytics platform will allow for wider access to the store of data on 500,000 British volunteers, as well as faster processing times and lower overall cost, officials from the London-based program said.
UK Biobank expects to have the technology online by October for testing with its four main pharmaceutical partners — Amgen, AstraZeneca, GlaxoSmithKline, and Johnson & Johnson — with the goal of opening the platform up to the general research community in mid-2021.
"The plan is that as soon as the platform is in place in October, those four companies will start using the platform for analyses of the first tranche of sequence data, along with other UK Biobank data," said UK Biobank Principal Investigator and CEO Rory Collins. There will be further development and testing before the technology is opened up to the research public next year.
Funding is coming primarily from the Wellcome Trust. AWS also has pledged $1.5 million in research credits for its cloud platform to support researchers in low- and middle-income countries and early-career scientists worldwide in an effort to "democratize" access to the dataset, according to Collins.
Amazon has not offered more specifics on who would be eligible for free access, though Collins said that UK Biobank, AWS, and DNAnexus agreed that they want the platform to be used widely. "Any barriers we can remove to investigation by different people on the data is a good thing," he said. "That, I think, is in many ways the most exciting aspect of this."
Until now, UK Biobank has been sending data to credentialed researchers on request. However, within the last year, the program started having discussions with its main benefactors, namely the Wellcome Trust, the UK's Medical Research Council, and the four pharma partners funding UK Biobank sequencing.
UK Research and Innovation and the Wellcome Trust are each contributing £50 million ($65.5 million) to UK Biobank as a whole. The four pharma firms are providing a total of £100 million.
The Wellcome Sanger Institute and Decode Genetics, a unit of Amgen, last year began whole-genome sequencing of biobank samples. Collins said this week that about 150,000 samples have been sequenced, and all should be completed by the summer of 2021.
According to Collins, those partners have been discussing for the better part of a year what it would take from a bioinformatics perspective to manage a half-million whole-genome sequences.
"The rationale for the platform and for the funding was the scale of the data," Collins said. "It was necessary for it to include funding for a data analysis platform because the data would be of a scale that it would be impractical to send the data to researchers," he said.
Collins said that there are about 15,000 researchers globally working on about 1,500 projects with UK Biobank data, but the vast majority are in "well-funded areas" like Europe, North American and Australasia. Most work in academia or the pharma industry.
"One hope is that this may well bring in a lot more of the bioinformatics community," Collins said. "It is a very large, very rich, systematic database that, despite having thousands of users, is still, in my view, underused by a lot of really smart scientists."
He would like to see the analysis platform attract researchers who do not currently have well-developed computing infrastructure at their fingertips. Collins said that DNAnexus has promised to develop the technology so it is usable by researchers without much of an informatics background.
"Over the past 11 years, DNAnexus has supported the diverse scientific aims of researchers worldwide, accelerating digital transformation by simplifying complex data analysis, clinical data management, and insights at scale," DNAnexus CEO Richard Daly said in a statement. "We enthusiastically support the foundational UK Biobank project as it breaks new ground in the advancement of disease research through the integration of deep healthcare data with genomics and advanced tools."
A spokesperson for Mountain View, California-based DNAnexus declined to comment further.
According to Collins, DNAnexus will mostly be offering the kind of analytics tools it already provides to its customers and partners, including some visualization applications. He noted that the US-based vendor has been working with Regeneron for years, including for analysis of UK Biobank exome data.
"Although it's being funded with the focus on sequence data, we and they are very keen to make sure it has packages on it that nongenetic researchers use a lot," including generic or open-source epidemiological components, Collins said. DNAnexus will add capabilities over time, including the option for researchers to bring their own analytics tools to the platform to work with UK Biobank data in a secure environment.
Until now, "they've been used to downloading the data, and there's likely to be some inertia," Collins said. That option will still be available for those with sufficient computing power in their own institutions, but will be less practical as more WGS data comes online and as the database is updated.
"If we encourage new users through this arrangement with Amazon, then hopefully they will not even start with the download. They'll go straight to the platform," Collins said.
Some might offer their own open-source tools to the platform as well. "We hope that it will become something where not just UK Biobank or DNAnexus helps in developing it, but the researchers themselves who use it help to develop it," he added.
DNAnexus is less than year into a five-year, $20 million contract from the US Food and Drug Administration to enhance the PrecisionFDA platform's capabilities for sponsor-reviewer interaction, add support for multi-omics, and provide a library of analytical, statistical, and machine-learning applications. The company has been developing the PrecisionFDA platform since its launch in 2015.
Collins expects UK Biobank to "benefit enormously" from the work DNAnexus has performed for the FDA and others. "It may well encourage the people who have been involved in these other platforms to use the UK Biobank data, as well," he said.
Collins admitted that he had not considered the possibility that some existing DNAnexus users might discover UK Biobank data for the first time, but said that he is optimistic that the exposure will make the database more popular.
Regardless, he expressed excitement about expanding the user base because it will encourage creative applications of the data.
UK Biobank simply requires that users undertake health-related research that is in the public interest and that they follow procedures to safeguard the identities of sample donors. Beyond that, they are free to be as innovative as possible.
"What we've seen is people being extraordinarily imaginative. I think that's why I'm very keen that we increase the range of researchers who use the data because different researchers approach the same data with very different mindsets," Collins said.
"I think that you get a lot more out of it by having people looking at it from different angles."