NEW YORK (GenomeWeb News) – A new study confirms that it is possible to identify individual study participants from genome-wide association data.
In a paper appearing in the advance, online edition of Nature Genetics yesterday, researchers from the National Institutes of Health and elsewhere described their likelihood-based statistical framework approach for using genotype frequencies and individual genotype information to detect individuals in GWAS. In so doing, the team verified their suspicion that GWAS participants — or close relatives — can be identified from aggregate GWAS data.
The research is an extension of work by Translational Genomics Research Institute and University of California at Los Angeles researchers who reported in PLoS Genetics last summer that they could identify individuals from pooled genetic data, lead author Kevin Jacobs, a contract researcher with the National Cancer Institute and owner of the Maryland-based company BioInformed, told GenomeWeb Daily News.
Jacobs noted that TGen researcher David Craig, who was senior author on the PLoS Genetics paper, was involved in the current study.
That research, which relied on a SNP probe intensity or allele frequency-based statistic, prompted the NIH and others to remove pooled DNA datasets from publicly accessible web sites. But, Jacobs explained, that study implied but did not demonstrate that a similar method could be used to identify individuals from GWAS data.
In an effort to understand and characterize the privacy risks associated with GWAS, the researchers developed a similar statistical method based on genotype frequency and used it to evaluate a hypothetical GWAS of between 5,000 and 200,000 SNPs in 1,000 cases and 1,000 controls and assuming no genotyping error.
They then tested the approach using actual GWAS data from several studies performed at NIH. The data included information on 6,733 cases and 6,871 controls genotyped at more than half a million SNPs.
Indeed, the team found that they could identify both case and control individuals when they had access to the individual's genetic data and aggregate GWAS data.
Nevertheless, the ability to do so depended on several factors, Jacobs explained, including the size of the group, number of individual genetic markers used, amount of genotyping error and, if evaluating a relative, the degree of relationship between that individual and the GWAS participant.
According to their calculations, though, it would take a substantial genotyping error rate to significantly decrease the power to detect an individual, Jacobs said.
Those involved say the current study represents the lower bound for detecting individuals in GWAS data. They predict more efficient methods may be developed in the future. For now, though, the team suggests the work will prove useful for understanding risks and ensuring participant privacy.
"In light of these developments, the policies and practices guiding genomic data sharing should continue to evolve in order to promote quality science, minimize duplicative research and merit the ongoing trust of the research subjects who consent to participate in scientific studies," the researchers concluded.
Even so, Jacobs emphasized that there are currently strict standards in place for protecting individual privacy in GWAS. For instance, he explained, aggregate GWAS data is now protected in much the same way as individual genetic data. To access it, researchers must seek approval from an ethics board, submit a research plan, and so on.
"Essentially, right now it's a very rigorous standard for accessing either of these data," Jacobs said.