NEW YORK – Researchers have developed a way to protect genetic privacy while also enabling data to be shared for functional genomic analyses.
To work out how gene expression affects various phenotypes, scientists rely on broad sharing of genomic datasets. But sharing also raises concerns about the privacy of individuals whose information is included in such datasets.
Further, as functional genomics analyses typically focus on one disease at a time, anyone whose data is included in such a study could be at risk of having their disease status or other private health information figured out. Researchers from Yale University have noted, however, that functional genomics analyses do not require variant data, meaning that variants could be "sanitized" to prevent personal and genomic data from being linked.
After illustrating how data leakage may occur from functional genomics studies, the Yale team presented a file-format manipulation approach that allows raw reads to be shared while minimizing privacy leakages, as they reported on Thursday in the journal Cell.
"We can protect individual privacy while still encouraging people to participate in genetic studies that are undeniably good for society," senior author Mark Gerstein said in a statement.
A particular security and privacy issue for these databases is linkage attacks, in which publicly available information is used to tease out otherwise anonymous data from a separate database. In this study, Gerstein and his colleagues focused on combating a scenario in which a nefarious actor surreptitiously collects DNA from an individual, such as from a coffee cup, and overlaps that with information from an anonymized database. A DNA sample like that, when combined with a dataset from a study of individuals with bipolar disorder, for instance, could then determine whether a particular person has that condition.
When the researchers mimicked this scenario — using DNA samples collected from the coffee cups of consenting individuals — they found they could link those samples to the correct individuals in the database and infer their private health information. Similarly, they could use RNA-seq data from 421 individuals from the gEUVADIS project to identify those who were also in the 1000 Genomes Project, again, to uncover sensitive personal information.
To prevent such data leaks, the researchers devised a data-sanitization procedure that masks particular variants within the reads of a dataset — such as private, identifying mutations or those in linkage disequilibrium with them — by replacing them with generalized data from the human reference genome.
In particular, the researchers developed a file-format manipulation approach that converts a raw alignment file (BAM) to a privacy-preserving BAM (pBAM) file. Users can tune the approach to choose the level of data privacy they wish to impose. The removed data is then stored in a compressed file format that is kept under controlled access. This way, the researchers said, they can maintain both data privacy and utility. The approach is compatible with a range of software and pipelines, they noted.
"As more data are released for these kinds of functional genomics studies, concerns about security and privacy shouldn't be lost," Gerstein said. "At the dawn of the Internet, people didn't realize how important their online activities would become. Now that type of digital privacy has become so important to us. If we move into an era where getting your genome sequenced becomes routine, we don't want these worries about health privacy to become dominating."