NEW YORK (GenomeWeb) – A team of researchers from the computer science, bioinformatics, mathematics, and human genetics departments at the University of California, Los Angeles have published the particulars of a protocol that they developed to make it possible to identify genetically related individuals from whole-genome sequence data without compromising the privacy of the individuals in question.
"Sequencing technologies have made personal genomics possible and many companies are providing information about ancestry and health of individuals by utilizing genetic data," the researchers wrote in Bioinformatics. Currently, the standard approach that these companies use for relatedness testing is to gather and store genomic data from multiple individuals in a single database, compare pairs of sequences, and inform individuals when they find a genetic match. But this process requires individuals to share all of their genomic information which raises privacy concerns, they said. Their protocol offers an alternative mechanism for sharing data that does not require input from commercial third parties nor requires individuals to make all of their information public.
Basically, what's publicly shared is an encrypted version of each individual's genomic data, which is matched against encrypted datasets shared publicly by other individuals, Eleazar Eskin, an associate professor in UCLA's computer science and human genetics departments and an author on the paper, explained to BioInform. If the relationship between the persons in question is a close one, the system can make the connection by looking at the encrypted information alone.
The software builds on an earlier tool that was developed and published by many of the authors on the current paper along with other UCLA colleagues in Genome Research. Both methods are similar except that the software described in Genome Research works with microarray data only. It's also limited in the sense that it can only identify individuals who are first or second cousins, Eskin, who is also an author on that paper, explained.
The method that's described in Bioinformatics on the other hand, he said, makes use of whole-genome sequences. It is able to use both common and rare variants — which members of the same family share — to not only identify related individuals but also to locate far more distant relatives than is possible for its predecessor. The paper includes results using both real and simulated data from the 1000 Genomes Project, in which the researchers were able to detect up to fifth degree cousins.
Both methods use a so-called fuzzy encryption technique, which operates in a similar fashion to traditional encryption and decryption protocols where individual users have public and private keys. As the names imply, a person's public key is accessible by all the other individuals while the private key remains personal to each individual user.
"In the traditional protocol, we use the same private key to decrypt the message that was used to encrypt the message in the first place," but "in the 'fuzzy' encryption the two keys should be only close but not necessarily the same," the paper explains. "Thus, an individual can detect the genetic relatives by downloading the available public key for all other individuals and compare their public key with his private key." If both persons are related, the method detects the relationship without revealing any information deemed private. Furthermore, it prevents unrelated individuals from obtaining access to any private data.
That's where the similarities between the two methods end. While, the version of the software described in the Genome Research paper only works with common variants, the newer model has an added encoding mechanism that makes it possible to use both common and rare variants, according to the Bioinformatics paper. That mechanism converts "each individual's haplotypes to a set of integer values such that the comparison between two sets approximate the genetic comparison between the two individuals where each individual has access only to its own variants list," they explained. This same encoding mechanism also makes it possible to compare data from individuals whose variants were called using different genome builds, according to the paper.
Eskin and his colleagues believe the method could interest individuals who've already had their genomes sequenced and are interested in exploring their ancestry. They also believe that it could be a useful tool for research centers interested in combining data from their respective genetic studies that have not been able to do so because of institutional constraints. Along those lines, as part of their next steps, Eskin and his team are working on improvements to their software to better support institutional data sharing. He also said that the team will work on a mechanism that allows users to submit their data to disease-based genetic studies while maintaining their privacy.
A member of the UCLA team will present the method during this year's Intelligent Systems for Molecular Biology conference to be held next month in Boston. Both tools will be available in the same software package available here. For now, the only tool available is for encrypting and comparing microarray data, but the developers will add the protocol for sequence data next week.