NEW YORK — It may not be as easy to identify individuals that are part of genomic datasets using publicly available photos of their faces as some have worried, a new analysis has found.
Previously, investigators had raised alarms that knowing just a few phenotypes, like eye color, and associated genotypes could be used to match genomic data uploaded into the public domain or even de-identified medical genomic data to publicly available photographs, such as those on social media, which could then be used to identify the person from whom the genomic data originated.
But researchers from Washington University in St. Louis and Vanderbilt University have now reported in Science Advances that the identification process may be more difficult under real-world conditions in which pictures aren't as high-resolution as those used in studies and in which phenotypes might be masked by people who dye their hair different colors. To test the risk of reidentification, the researchers developed a dataset of SNPs from OpenSNP that they tied to user images available online and two synthetic datasets and found the risk to be low.
"Our findings suggest that the concerns about privacy risks to shared genomic data stemming from the attacks matching genomes to publicly published face photographs are low and relatively easy to manage to allay even the diverse privacy concerns of individuals," senior author Yevgeniy Vorobeychik from WUSTL and colleagues wrote in their paper.
The researchers generated a dataset of 126 individuals who had genomic data in the public OpenSNP database that they could tie to publicly posted photographs, such as a user picture from OpenSNP or a photo on other sites where they had the same username, a set they dubbed the Real dataset. At the same time, they generated two synthetic datasets using a subset of the CelebA face image dataset and OpenSNP.
They then used deep neural network models to gauge visible phenotypes from the photographs, which they then applied to predict the relationship between those phenotypes and SNPs to match the images to genotypes.
Within the real dataset, they found that the risk of reidentification was low, though it varied based on factors like population size. In particular, they found that the difficulty in obtaining eye color from images was a stumbling block to reidentification accuracy.
Analysis of the synthetic datasets, likewise, found eye color to be the most difficult to get right, as well as the most crucial trait for matching genotype to image.
They added, though, that if malicious attackers do have access to particularly high-quality data, the reidentification risk can reach 60 percent for small populations. Generally, though, they found that the risk of identification for populations of more than 100 individuals was negligible.
Still, the researchers also suggested a means of thwarting attackers by introducing small, nearly imperceptible perturbations into images. This, they reported, could further limit the success of reidentification efforts.
"We show that, even using imperceptible noise, we can often successfully reduce privacy risk, even if we specifically train deep neural networks to be robust to such noise," the researchers wrote. "Furthermore, adding noise that is mildly perceptible further reduces the success rate of reidentification to be no better than random guessing."
The researchers cautioned, though, that their analysis is based on currently available technology and that the wider availability of high-definition 3D photography or improved artificial intelligence could increase reidentification risks.