Following the National Institutes of Health’s decision last week to modify its policy for accessing data from genome-wide association studies in NIH databases, the agency is currently evaluating the best way to enable research groups to post, find, use, and share the data while taking into account a new discovery that individual DNA can be identified from pooled SNP datasets.
Researchers from the Translational Genomics Institute last week published a study in PLoS Genetics that proved it is possible to use bioinformatics to detect a single individual’s SNP profile in publicly available aggregate datasets — something that was thought to be impossible prior to the TGen study.
In response, NIH modified its data-access policy for GWAS data. The agency previously followed a two-tiered approach that offered open access to summary-level information and aggregate genotype data, but controlled access to individual-level genotypes and phenotypes. In line with the new policy, NIH has moved the aggregate statistics files to controlled-access status. This data is available to researchers who submit data-access request packages to NIH-designated Data Access Committees.
The policy change affects the NIH’s Database of Genotype and Phenotype, or dbGaP, and the Cancer Genetic Markers of Susceptibility database run by the National Cancer Institute. According to the NIH, other groups, including the Broad Institute of MIT and Harvard, and the Wellcome Trust Case Control Consortium, have removed aggregate data from public access as well. The University of California, Santa Cruz, Genome Bioinformatics Group has also removed the NIMH Bipolar and Wellcome Trust Case Control Consortium data sets from its Genome Browser site.
Laura Rodriguez, acting director of the National Human Genome Research Institute’s Office of Policy, Communications and Education, and senior advisor to the director for research policy, told BioInform this week that no definitive long-term decisions have been made with regard to the revised policy.
“At this point the intent is that the data as they were would not be in the open access pages,” she said, adding that NIH is considering both technology and policy issues “with regard to what the long-term solution might be.”
NIH has not determined yet whether the policy revision will be permanent, she said. “We took action as soon as the basic methodology presented in the [David] Craig paper was in fact [found to be] correct and valid,” she said, which led to the removal of files from the open-access sections of the public databases. She noted that most of this data is still available through the controlled-access sections of these databases.
In a fact sheet about the revised GWAS policy that NIH posted on its website last week, the agency notes that “although the technique has been demonstrated to work, the NIH is unaware that it has been used to compromise any information within NIH GWAS datasets.”
Nevertheless, NIH added, the TGen discovery “has important policy implications for the way the scientific community shares such pooled sets of genetic data,” particularly since some scientific journals now require researchers to make available aggregate data from GWAS studies when the results are published.
Rodriguez said that TGen brought the study to the NIH’s attention prior to publication, in order to give the agency’s researchers time to test and verify the findings on other datasets. “It was very helpful that [Craig] was in direct communication with us and he had done so before the publication came out.”
In a listserv notice this week, the UCSC Genome Bioinformatics Group said it will be “participating in discussions about the implications of these new statistical approaches and how data repositories can best supply data access for scientific research while still protecting the confidentiality of the individual participants.”
Rodriguez noted that NIH is not demanding that research institutes adhere to the policy change, which so far only applies to the databases it manages. “We just let people know about that, but we don’t control the other resources that might exist,” she said.
The statement that NIH issued last week is not to be seen as “a recommendation,” but rather as “a notification of what the NIH is doing,” she said, adding that NIH is currently discussing the data it manages as well as the institute’s role within the broader scientific community. “Exactly what course that broader discussion will take, it is too early to know at this point.”
Controlled Gate
In dbGaP there has always been a division between controlled and open access [BioInform 10-22-06]. All data in the database is still available to researchers, with the exception of a few studies that only had summary-level statistics files, said Rodriguez. These studies will be available “as soon as possible,” through controlled access but right now these files are being integrated into the established data access policy, she said.
Because GWAS is “fast-moving,” both in terms of technology and ethics, NIH established a governance structure last year to implement and manage its policy. Elizabeth Nabel, director of the National Heart, Lung, and Blood Institute, chairs the oversight committee, which is comprised of senior level institute and center directors at NIH.
This committee made the decision to revise the data-access policy “when this new development happened,” she said.
“This is not something which is somehow unique to the dbGaP database or the NIH GWAS sharing policy. This affects all genetics journals and university websites.” |
Rodriguez added that a working group of the advisory committee to NIH director Elias Zerhouni is also focused on GWAS policies and includes investigators outside NIH, including statisticians, informaticists, and geneticists as “participant protection advocates.”
If NIH decides to implement controlled access for summary-level data permanently, it will be important to develop a “mechanism for providing it with as little burden as possible to researchers for research access,” she said.
Among the technological approaches being considered are various types of “data redaction,” looking at the mathematics of the new method developed by TGen, Rodriguez explained. There are different ideas about “whether you can not provide 100 percent of the allele frequency, but rather 75 percent: Does that invalidate the statistical algorithm that makes it possible to place an individual within a dataset or not?”
Jim Ostell, chief of the information engineering branch at NIH responsible for designing, developing, deploying and maintaining all the public resources at NCBI, stressed that the “big flap” is about aggregate summary data, which is increasingly available online.
As scientific studies have aggregated larger numbers of SNPs, including SNP chip data, some scientists have published them as supplementary data alongside their papers or they might also place summary data on their websites, he said.
“This is not something which is somehow unique to the dbGaP database or the NIH GWAS sharing policy. This affects all genetics journals and university websites.”
He stressed that NIH has not changed its policy for sharing individual-level data, which was always under controlled access. “This affects data that everyone was already sharing before NIH did anything.”
Originally, he said, TGen’s Craig was thinking of forensics applications to identify individuals using SNP data, for example in mass fatalities. His finding that it is possible to identify individuals “very accurately” because there are so many SNPs was a surprising result, Ostell noted.
“Prior expectations were that individual profiles would have to be compared one to one to confirm a match; however, this new statistical analysis can now be used to detect a profile even in pooled data,” the NIH fact sheet notes.
The reason NIH responded so quickly, he said, is part of being “ahead of the curve. As soon as there is any hint that you can make a connection like this, [you must] take it seriously and think about the consequences.”
“It is a signal to everybody. NIH is applying it to itself,” he said. “Nobody in the field considered that aggregate data needed to be protected.”
The controlled-access mechanism in place at dbGaP and the agency’s existing GWAS policy ensured a quick response, he said. “The [datasets] are still available, just only through controlled access.”
He said that “a number of people and journals” previously approached NCBI about hosting aggregate data “in cases where there were studies being published where the patient consent did not allow redistribution of individual-level data.”
That way, researchers would not have to place such data on their own websites, he said. “Some of the journals thought it would be a good idea because instead of it being supplementary data in their journal they could cite an accession number in a public database.”
That idea was generally not one that he and his colleagues were pushing, however. “We were busy with the individual data, which had to be handled carefully.”
Now the situation has changed, so NIH might be able to step in, Ostell said. “What’s under discussion now at NIH is, since we can do this [controlled access] and we are set up to do this, it would be a way to continue to allow people to share this data but not have it be totally public and not have it disappear either,” he said. “We actually have a solution.”
Smaller labs, for example, do not have the resources to respond to queries from around the world requesting access to data. “NIH has this infrastructure in place because they do it for grant applications,” he said. It is the method dbGaP uses to determine access privileges, to determine the legitimacy of queries and to authenticate author affiliation.
“It uses the NIH grant application infrastructure,” he said, through with a host of checks and balances are used to vouch for the scientists. “It’s a serious infrastructure to do [this task].”
Not Everyone
Not all genotype/phenotype databases are equally affected by TGen’s findings. Stanford University’s Russ Altman told BioInform this week via e-mail that Stanford’s PharmGKB database includes “anonymously-accessible aggregated data … from different populations, so that there is no way to apply the algorithms in that paper because some people contributed some SNPs, but a very very small percentage contributed to all SNPs, so statistically it's much more obfuscated.”
The whole-genome data in PharmGKB “does not generally contribute to our aggregated statistics, and so that is available only to registered users on the individual level,” he said.
“We are in OK shape,” he said. “Aggregated data does not meet the criteria that the new method requires, and whole-genome data is available only to credentialed scientists.”
While he doesn’t exclude the possibility that “something couldn't go wrong,” he added, “I don't think this new paper puts anything in any more jeopardy than it was before.”
Further information on NIH’s GWAS policy is available here.