NEW YORK (GenomeWeb) – A pair of researchers at Stanford University School of Medicine have published a paper in the American Journal of Human Genetics that demonstrates a technique for identifying individuals and their associated phenotype information using a querying mechanism set up to enable genomic data sharing. They also offered suggestions on how to minimize the risks associated with such identifications.
The Stanford study, conducted by Suyash Shringarpure, a postdoctoral scholar in genetics, and Carlos Bustamante, a professor of genetics, focused on the Global Alliance for Genomics and Health's use of beacons, a mechanism for sharing information about datasets housed in disparate repositories and databases, and demonstrated a very specific scenario in which the security of information contained in these resources could potentially be compromised.
Selected as one of three GA4GH driver initiatives, beacons provide a simple system for participating institutions and organizations to share very basic information with each other. Beacons are essentially servers installed locally by institutions in a network that external users can send simple queries to for information about de-identified genomic data available at the site. These queries are in the form, 'do you have a genome that has a T at a specific position on a specific chromosome,' and the server responds with a simple 'yes or no' answer. If the variant of interest is present, the requesting researcher can then follow whatever protocols are in place to ask for additional information or more complete access to the data.
Bustamante and Shringarpure came up with a likelihood-ratio test that uses beacon responses to make predictions about presence or absence of an individual's genome in a beacon database. This study builds on a previous one that showed how to use allele frequency information from genome-wide association studies to calculate the probability that a given individual was a participant in a study.
The Stanford team wanted to do something similar but this time using only information about the presence or absence of an allele. "The question is, can we get an estimate of the frequency from this binary response?," Shringarpure explained to GenomeWeb. "It turns out that if you look at the distribution of frequency of mutations in human populations, most of them are rare. So [even though you are] querying random SNPs, you will by chance hit on SNPs that are very rare and therefore are likely to come from the person that you are querying. That's the idea. Then you can [figure out] how many SNPs you need [to determine] how confident your prediction is and what your false positive rate will be."
The approach requires, among other things, that the person running the search has access to data on SNP positions where the individual of interest has alternate alleles as well as the genotype calls at the corresponding positions. It could also work if the user has SNP data from a close relative. The requester should also have access to additional information about the number of individuals in the database and site frequency spectrum of the population in the beacon in this scenario, according to the paper.
With these datasets in hand, a requester can then query beacons using the alternate alleles in the SNP list as input. Based on the responses, they can then calculate the likelihood that the individual of interest is in the beacon and the likelihood that they are not in the database, and then compute a ratio of the two likelihoods to obtain the relevant test statistic.
According to results provided in the paper, using a simulated beacon database comprising 1,000 individuals and 500,000 SNPs, the test had more than 95 percent power to detect whether an individual was in the Beacon with just 5,000 SNP queries, the researchers wrote. With another test beacon that contained data from 65 individuals gleaned from the 1000 Genomes Project, the researchers showed that with just 250 SNPs, they could detect individuals in the beacon with 95 percent power.
The main risk in this scenario is the potential for discovering private phenotype information about a person in the beacon. Datasets in these resources are de-identified and the queries themselves are designed to share very limited information upfront. Researchers still have to go through whatever approval mechanisms are in place to obtain access to the data.
The problem, however, is that many genomic datasets are associated with particular disorders or diseases and are included in beacons that are associated with specific phenotypes. For example, out of nine beacons that index non-publicly available genomic data, four are tied to specific phenotypes including cardiac disease, IBD, and autism spectrum disorders, the Stanford researchers wrote. Once an individual is pegged as a member of one of these beacons, a user could make some reasonable inferences about phenotypes that might be associated with the person.
The Stanford researchers are now collaborating with GA4GH on additional protective measures for beacons. Bustamante said during his conversation with GenomeWeb that the GA4GH had already begun to implement some of the ideas that he and Shringarpure described in AJHG paper, for example, combining smaller beacons to boost database population sizes and obscure phenotypes, but their study provides added value to these efforts. For instance, it offers insights into how big aggregate beacons need to have a protective effect.
There are also some suggestions based on the study findings that are now being looked into. For example, "One of the things that we emphasized [in the paper] was that we had to have non-anonymous [queries]," Bustamante said. Users should be required to register so that beacon providers know who is running the query and where the query comes from.
In a conversation with GenomeWeb, Peter Goodhand, executive director of the GA4GH, acknowledged that under the very unique circumstances described in the paper there is a potential re-identification risk. But, he stressed that in all cases beacon providers have implemented appropriate protections — in keeping with its established security policies and guidelines established by the GA4GH — to minimize the risk of privacy breaches to the individuals who contribute to their datasets and ensure that they share data in a secure fashion.
"It is an extreme circumstance, [but] it's a good reminder for us to find that balance between the benefits of data sharing and the need to protect privacy," he said. He also said that some of the developers of the network were aware of the theoretical potential for identifying members of the beacons, but the Stanford paper formalized the technique. Goodhand also said that the alliance has been in touch with beacon providers about the Stanford study findings and will work with any who might want to adopt additional security measures for their beacons as a result.
In a statement from the GA4GH prepared in response to the Stanford study, the alliance noted that if someone had already obtained another individual's sequences, there would little additional benefit to learning that the same person also appears in a beacon database. But, given the potential risks of unauthorized users accessing phenotypic data, it is taking precautionary steps.
Goodhand said that the group had already begun aggregating data from multiple beacons, monitoring use, and working on implementing multiple tiers of secured data access, including requiring users to pass through an authorization process for access to more sensitive datasets. It was also in the planning stages of an information budgeting system, and will now use the math provided in the Stanford paper to come up with a more robust system, he said. This system would track the rate at which information is revealed and restrict access when the information disclosed exceeds a certain threshold to help mitigate the risk of privacy breaches.
Meanwhile, Shringarpure and Bustamante are exploring more benign uses for their statistical test. They believe it could be helpful for exploring mixtures of DNA in the context of forensics applications as well as in microbiome and ecological studies.