CHICAGO – The rise of large-scale population genomics programs has led to a rethinking of how data centers store, manage, and disseminate datasets of unprecedented size and provide sufficient computing power to end users. Approaches include a blend of data storage types as well as both centrally managed and cloud-based data processing, as seen in two large-scale projects in the US and the UK.
In partnership with the Broad Institute and Google sister company Verily Life Sciences, Vanderbilt University Medical Center, for example, is hosting the data and research center for All of Us, the US National Institutes of Health's research program to collect, store, and disseminate health, genetic, lifestyle, and environmental data on at least 1 million US residents.
Storage needs for All of Us have ballooned from original plans and the program is now gathering not only genome sequences but also phenotypic data, including from consumer-grade wearable devices. However, genomics makes up the vast majority of the data volume.
In the early stages of All of Us, Vanderbilt, the Broad, and Verily estimated that the data center would need about 50 petabytes of storage to hold participants' data and to begin building systems needed to ingest, store, and ready the data for research. Eventually, the center will require at least 88 PB of storage capacity, according to Paul Harris, director of the Office of Research Informatics at Vanderbilt.
Harris, in consultation with Melissa Basford, director of big data support services at Vanderbilt, and Anthony Philippakis, chief data officer of the Broad, provided information to GenomeWeb via email.
The All of Us data team chose Google Cloud to host the data center because the effort involved multiple sites and entities. "With the size of the data and the number of diverse users, we felt a commercial cloud environment was the most suitable," Harris and collaborators said.
Broad and Verily — like Google, under the Alphabet corporate umbrella — had already been working together to build a cloud-based bioinformatics analysis platform called Terra, also on Google Cloud. They provide NIH-affiliated researchers with access to Terra under both the All of Us program and the Accelerating Medicines Partnership for Parkinson's Disease program.
This previous partnership allowed those working on the All of Us data center to take advantage of existing infrastructure and expertise, but they still had to make plenty of modifications.
"Given the storage requirements by a program that wanted to recruit a million or more Americans and collect a variety of data, including genomics, we knew the compute and storage needs would be substantial," Harris said. "Working with teams from multiple organizations meant a naturally distributed environment that was conducive to cloud-based work."
"We envisioned a different kind of data sharing model" for All of Us because of the size and diversity of the datasets, he added.
Rather than transferring data to researchers for each project, All of Us decided to build a platform called the Researcher Workbench, where data stays within the All of Us research environment.
Users cannot remove individual-level data from the platform, but they can analyze this information within it. Researchers can download aggregate data and bring their own datasets into the workbench, though Harris said that these transfers so far have been small.
"This makes sharing broadly more streamlined while also adding layers of protection, as the data is not repeatedly shared across a number of different environments," Harris said.
Early results for this approach have been mixed. On the positive side, it allows for collaboration and scalability. However, Vanderbilt, the Broad, and Verily have had to increase training in cybersecurity and IT systems administration.
"The All of Us program is complex and has required a great deal of investment in modeling data collection, curation, storage, and accessibility to recruitment site personnel, NIH program officials, research teams, and the general public," Harris said. "Compute costs will also rise as needs increase for managing and transforming the data as well as for tools for using the data."
All of Us data on network usage, including the numbers of researchers requesting data and the volume of data stored and transmitted, has not yet been made public, and Harris said that All of Us is working on the initial release of this information. While some data from electronic health records and patient surveys has been made available, genomic data will not be ready until early 2021.
Harris said that it is probably too early to measure cost savings and research efficiency gains from the All of Us data center, though he said there is anecdotal evidence that provides confidence in the strategy.
"Connecting researchers with pre-collected data and generalizable tools supporting rapid discovery and analysis will lead to new research findings at a fraction of the cost and time required to initiate and perform traditional one-off studies," he said.
An even larger program in the UK has already released data on cost savings from migrating its data storage to a new file system.
Genomics England plans on amassing a database of 5 million genome sequences from patients in the UK's National Health Service by 2023. About 2 million of those will be from new patients, with the remainder coming from existing datasets, according to David Ardley, director of platform engineering for the program.
Since beginning to move to WekaFS, a parallel, distributed file system from Campbell, California-based WekaIO, a little more than a year ago, Genomics England has seen storage costs decline by 75 percent from £52 ($64) to £13 per whole genome. Genomics England expects that to fall to just £2 by 2023, or 96 percent below the baseline, according to a non-peer-reviewed case study the technology vendor published earlier this year.
Genomics England is having to ramp its storage up from 21 PB as of late 2018 to 140 PB in the next three to five years to support all the sequencing data and related research needs. The dataset now takes up about 40 PB of space.
The WekaFS system, managed on the Google Cloud, was installed in March 2019. Since then, Genomics England has been migrating large data files from a legacy Dell EMC Isilon system hosted by British cloud provider UKCloud.
WekaIO actually only provides about 1.3 PB of that storage, in the form of nonvolatile memory express (NVMe) storage, essentially solid-state flash memory. Underneath the WekaIO infrastructure is an ActiveScale high-speed object storage platform from Western Digital, a product line that the storage giant Western Digital sold to digital content sharing platform vendor Quantum last month.
In addition to WekaIO and Western Digital, Genomics England brought in networking firm Mellanox Technologies to support the infrastructure.
A company called Nephos Technologies, a UK distributor for WekaIO, helped with vendor selection and now provides some managed services.
"We work with Nephos primarily, so we don't have a need outside of just having occasional roadmap update discussions to talk to Weka or Western Digital on a regular basis," Ardley said. That gives Genomics England a single point of contact for most vendor support.
"The architecture that we designed does allow for other object stores to be plugged in underneath, so we have that flexibility. We just haven't had a need to do that yet," he said.
Ardley said the flexibility is one reason why Genomics England decided to go with WekaIO.
"It looks like it's one system even though you've got different vendors, different object stores from different vendors. That's one of a few requirements that we had," he said.
The people on the front lines using the system don't really care who handles the storage as long as they are able to find and retrieve what they need, according to Ardley. Genomics England follows the Network File System standard, so all the files in the database are visible to an authenticated user just like they would be on a local hard drive.
This hybrid architecture is meant to balance speed and cost. "If you have to retrieve something from the [Western Digital] object store directly, relatively, it's quite slow, so it's a tradeoff between having everything flash, in which case it would be lightning quick but incredibly expensive," Ardley said.
Users do not always have to retrieve whole genomes, so those can sit in object storage until a researcher needs to see the entire sequence rather than a test report or just the variant information in a relatively small VCF file.
"If you've got someone running a big report or a big analysis, then it will potentially retrieve data out of cache. If another researcher is doing a very similar thing, then they'll read from the cache and they're fine," Ardley said.
The WekaIO platform can "pre-fetch" data, so researchers can theoretically call specific information into the NVMe cache before they drill down into their studies, Ardley said, though Genomics England has not enabled that feature yet.