NEW YORK (GenomeWeb) – The National Institutes of Health is working to revise the policy that governs access to and use of information contained in the Database of Genotypes and Phenotypes (dbGAP) to give researchers in the biomedical community the option to analyze data from the repository on public cloud infrastructure.
Vivien Bonazzi, currently program director for computational biology and bioinformatics at the National Human Genome Research Institute, told BioInform that the ongoing revision effort was spurred by repeated calls from the community for the NIH to change its existing policy. Bonazzi, who will soon move to a new position in the NIH's Office of Data Science, is working with individuals from the National Center for Biotechnology Information, which is responsible for maintaining dbGAP, and the policy group at the NIH's Office of the Director, on possible changes for the policy and considering what infrastructure needs to be in place to enable cloud use.
In the 18 months she has worked on this task, she said she has heard from a range of people in the community including "SBIR groups folks who run businesses in the cloud, researchers with R01 grants … and extremely large sequencing centers … and everyone wants this changed." Increasingly, researchers are turning to the often more cost-effective cloud infrastructure for their analysis and storage needs and that's likely to continue if sequencing datasets continue to grow at current rates. Already there have been cases where researchers have used the cloud to analyze internally generated data but have been unable to incorporate the rich genotype and phenotype information contained in dbGAP into their analyses because the existing policy precludes them from doing so.
That's because dbGAP's policy does not permit data from the repository to be analyzed on infrastructure that is connected to the internet, the rationale being that keeping the internet at bay would help ensure the security of the data and protect the privacy of research participants — a problematic proposition since the cloud relies on internet access to work. What that means is that in order to enable cloud access "we've got to make some wording changes to reflect the use of the certain technology we have," Bonazzi said.
But beyond updating policy, also needed are infrastructure for physically moving data from dbGAP to the cloud in an encrypted fashion, and, once the data is there, a way to make sure that it remains secure, Bonazzi said. The team is working on a standardized certification process that commercial and open-source cloud providers will have to pass through before researchers can use their infrastructure for dbGAP data analysis.
Currently, most researchers use infrastructure from Amazon, Google, and Microsoft, but any cloud vendors would be able to offer their services to the community provided they pass the established authentication process. So far, the cloud vendors Bonazzi has spoken to are onboard with having dbGAP data on their platforms and are willing to comply with whatever requirements the NIH puts in place, she told BioInform.
Bonazzi and her colleagues are also mulling guidelines that will govern how researchers can share data securely in the cloud. "We know people want to share [but] we have to be mindful of the fact that we are using human data; it's very sensitive and we have to be able to put some restrictions in the use of that" to ensure that only researchers with valid scientific reasons for accessing and using the data do so.
One option being discussed is to have individual principal investigators in a study who want to share data amongst themselves submit separate access applications. This works well if just a few researchers want to share data amongst themselves but it would not be as efficient for a larger group of 50 researchers, for example, who might want access to multiple datasets from multiple studies. So Bonazzi and her colleagues are trying to come up with a more efficient approach for handling bulk requests. Also being discussed is whether or not to allow dbGAP data to be stored in the cloud.
Technology sometimes evolves much faster than policy and "I think that's what happened in this situation," Bonazzi said. "We just have to figure out how to fix it."
Furthermore, with the NIH planning to build a common framework for biomedical data, software, training materials, and other resources that will include a combination of public and private cloud resources, this is the time to amend existing policy. "People are going to want to use dbGAP in those environments … [and] we need to have permissions [in place] for the controlled access data in those environments" ahead of time, Bonazzi said. "What are the authentication processes that [we] need and how do [we] automate those in such a way that many people can get access to it but are policed correctly to so that we can determine when something goes wrong?"
But the change will not happen overnight and it's still early days yet. "The key thing I want to get out to the community from the NIH is we understand there is a problem," Bonazzi said. "Now we have traction … and we want to help facilitate the appropriate changes so that we can facilitate [findings] in this new technology environment." However, "we need a bit of patience from the community because it isn't just about changing words on paper. We've got to figure out some of the technology containers that we need for this. It's going to take a little bit of time."