NEW YORK (GenomeWeb) – Borrowing from the field of cryptography, researchers from the Massachusetts Institute of Technology's Computer Science and Artificial Intelligence Laboratory and their collaborators at Indiana University at Bloomington have developed a method for querying large-scale databases from genome-wide association studies without compromising the privacy of the contributing individuals.
Described in a recent Cell Systems paper, the framework addresses both privacy and population stratification issues in GWAS, according to its developers. The so-called Preserving Privacy in Genomewide Association Studies (PrivGWAS) suite implements a cryptographic technique called differential privacy that works by adding noise or random variation to the results of database searches in order to confound algorithms that try to access private information by running multiple searches. The suite consists of two methods, PrivSTRAT and PrivLMM, which help researchers identify SNPs highly associated with diseases of interest, as well as estimate association statistics and the number of significant SNPs in a dataset.
These methods were developed in the laboratory of Bonnie Berger, a professor of mathematics at MIT. Berger, who is the corresponding author on the Cell Systems paper, told GenomeWeb that her lab has been working on "provable" methods of extracting useful information about genomic data from controlled access databases and repositories in a secure fashion for some time. "We thought by slightly perturbing the analysis results that maybe we can get a guarantee for research participants for the study," she said. "That's the whole idea behind differential privacy." Her lab has published a number of papers, including one published earlier this year in Bioinformatics, that discuss their efforts to securely search databases for genome-wide association studies.
Berger also told GenomeWeb that her lab looked at some existing methods of performing differential privacy that could not deal with the diverse ancestries that are present in many real-world genomic datasets, which is crucial for accurate genomic analysis, prompting her team to come up with their own implementation of the approach.
Those efforts to enable secure data searching dovetailed with at least one other project that was going on at the time. According to Sean Simmons, an MIT postdoctoral student in mathematics and first author on the new paper, other researchers in the lab were working on population stratification problems in genome-wide association studies — cases where a dataset contains information from many different populations which can lead to false positive results. "No one had really done anything to address the overlap between the privacy problem and the issues with population stratification," he told GenomeWeb.
And that's the issue that the current implementation of the differential privacy approach seeks to address. The MIT version of the method is designed to enable scientists to extract some statistical information from the database such as GWAS statistics or a list of SNPs that are correlated with a disease of interest. "In order to do that, we slightly perturb the results of the analysis," Simmons explained. "This could involve adding a bit of noise to the statistics being returned or adding a little bit of randomness in some other step of your analysis pipeline." The amount of noise added to the results depends on the strength of the privacy guarantee set by the researchers — meaning how low they set the threshold for the likelihood of leaking private information, according to the developers — as well as on the type and volume of the data.
The method provides "a mathematical guarantee of privacy, that’s the key thing," Berger stressed. Because of the added noise, there is a trade-off in the accuracy of the results but "we can actually prove that [it] is close enough." For example, a search for statistical correlations between SNPs and a disease of interest would return a modified p-value changed in some fashion by the random factor added to the results. However, it won't have changed so much as to confuse the researcher asking the question. "In practice, what we find is that if … you are interested in whether one particular SNP is related to a disease or not, usually you won't get an exact estimate of, say, a p-value or some other statistics but it will be close enough that you can usually tell if it is highly associated or not associated at all," Simmons explained. Tests described in the paper using real and synthetic datasets show that the methods perform well in terms of accuracy and runtime and their accuracy improves as sample sizes increase.
The algorithm also corrects for issues caused by population stratification, according to the developers. By way of illustration, imagine a particular SNP is associated with being lactose intolerant, Simmons explained in a statement. "Let's say that people in East Asia are more likely to be lactose intolerant than someone in, say, Northern Europe. But also Northern Europeans tend to be taller than people from East Asia," he said. "A naive method would suggest that this particular SNP has an effect on height, but it's really a false correlation.” The MIT algorithm, the researchers explained, addresses this problem by assuming that the largest variations in a given population are the results of differences between subpopulations and filtering them out. It then hones in on the variants that remain.
The developers believe that this method could serve as a stopgap measure for researchers allowing them to continue their work while they wait for the results of applications to use information in controlled access or private repositories. It is a time-consuming process involving lots of paperwork that drags on for months as data access committees evaluate the validity of research plans and proposals.
Fears about data leaks and concerns about privacy are not without some basis. Previous studies have shown that aggregate genomic data such as GWAS statistics can expose private information about contributing individuals. In 2013, researchers from the Whitehead Institute for Biomedical Research and elsewhere published a study in Science that showed that they could deduce the identities of participants in public sequencing projects using publicly accessible genetic and demographic information including age and geographic location.
Their findings prompted the National Human Genome Research Institute and the National Institute of General Medical Sciences to relocate age information from the publicly accessible portion of the NIGMS Human Genetic Cell Repository to a controlled access location. The agencies also at the time called for a dialogue between research participants, researchers, clinicians, advocacy groups, and other stakeholders focused on balancing research participants' privacy rights with benefits of sharing data to improve research.
Other privacy preserving methods such as Beacons, which are used by the Global Alliance for Genomics and Health, are also prone to leaks. In a study published last year, researchers from Stanford University School of Medicine demonstrated a statistical technique for identifying individuals and their phenotypes by querying Beacons. The approach requires, among other things, that the person running the search has access to data on SNP positions where the individual of interest has alternate alleles as well as the genotype calls at the corresponding positions. The requester also has to have information about the number of individuals in the database and site frequency spectrum of the population in the beacon. With these datasets in hand, the requester can then query beacons using the alternate alleles in the SNP list as the input. Based on the responses, they can make predictions about presence or absence of an individual's genome in a beacon database.
Hoping to come up with more secure methods for analyzing genomic data, researchers from academia and industry launched a community challenge run by the Integrating Data for Analysis, Anonymization, and Sharing (IDASH) center at University of California, San Diego to evaluate different privacy-ensuring methods of computing on genomic data. The third challenge organized by the group focused specifically on differential privacy methods and featured entries from six teams, according to a paper published in 2014 in BMC Medical Informatics and Decision Making. Participating teams had to aggregate information such as allele frequency data in a way that preserved the privacy of the donors and did not undermine its utility for GWAS studies. They were also asked to publish GWAS results that hewed to specified criteria. Challenge datasets were drawn from 200 participants in the Personal Genome Project as well as from 174 participants in the HapMap project.
"Obviously adding noise does lead to some limitations so you don't want to use this [method] to make life or death decisions," Simmons said. "The idea is that this a tool that we can use to get some qualitative results from the database without going through the time-consuming process of applying for access to the raw data." So researchers could, for example, use it to check whether the dataset they are interested in actually contains information pertinent to their research before applying for access, Simmons said. "You could do some preliminary research with it and then after you go through this process of getting access you could validate that the results agree with what you get using our privacy preserving method." He and his colleagues have released a basic version of their program available that others can implement and customize to work with their databases of interest.
For their next steps, the developers are working on reducing the amount of noise that gets added to the system, Simmons told GenomeWeb. "It might be possible to use information that we can get from genetics, for example, to be able to come up with alternative definitions of privacy that will work in this case with even less or perhaps no noise," he said. They are also collaborating with researchers at Harvard Medical School to apply these methods to electronic medical records data, Berger said. She is also collaborating with researchers at the Broad Institute to find ways to protect aggregate genomic datasets that the institute plans to release.