NEW YORK (GenomeWeb) – Researchers from academia and industry have launched the second iteration of a community challenge that aims to evaluate the performance of methods of computing securely on genomic data in remote environments like the cloud.
The challenge, which focuses on methods of computing on encrypted data, is organized by researchers from Indiana University, the University of California at San Diego, Emory University, Vanderbilt University, and La Jolla, Calif.-based Human Longevity. It is run under the auspices of the Integrating Data for Analysis, Anonymization, and Sharing (IDASH) center at UC San Diego — IDASH is one of the National Institutes of Health's National Centers for Biomedical Computing. The organizers planned and ran the first iteration of the challenge earlier this year and have submitted a paper for publication in BMC Medical Informatics & Decision Making that describes the challenge and results in detail.
For the second contest, dubbed the Secure Genome Analysis competition, the organizers have proposed two challenges. The first is called the secure genome-wide association study and it has two sub-challenges that deal with homomorphic encryption — a method of encoding data as ciphertext that allows specific computations to be run on it — and secure multiparty computing among multiple institutions.
In the first subtask, participating teams will receive two sets of genotypes — one for cases and the other for controls — over a few SNPS, and they will be expected to develop a homomorphic encryption protocol to encrypt the input datasets. The protocol should be able to move encrypted datasets to an untrusted remote server, compute the minor allele frequencies and chi-squared statistics for a given set of SNPs between the case and control groups, and decrypt the results using a privately held key. The algorithms will be tested on a single server and the performance will be measured in terms of computation time, space, and overhead.
In the second track of the first challenge, participating teams will get genotype data from cases and controls that will be partitioned into multiple datasets and hosted separately at different institutions. Teams will be expected to develop a "distributed cryptographic protocol" that will aggregate the minor allele frequencies and calculate chi-squared statistics for given SNPs between the cases and controls across multiple institutions in a secure fashion. Evaluators will test the submitted protocols on multiple servers in different locations all connected with a fixed bandwidth. They will use a large dataset that contains all the SNPs from the cases and controls and will measure protocol performance in terms of overall run time and data exchange.
For the second main challenge of the contest, called the secure third-party genome comparison without trusted third party, teams will receive genomic datasets from two individuals. They'll be required to develop a distributed cryptographic protocol to encrypt both sets of data and to compute the Hamming distance between the encrypted datasets. They will be expected to submit a suite of programs that implement their protocol, and these will be evaluated using a large dataset that will contain information from additional pairs of individuals.
The final details of the tasks for next year's challenge are still being hammered out but the organizers should have things finalized by early January, Bradley Malin, an associate professor of biomedical informatics and computer science at Vanderbilt and one of the challenge organizers, told BioInform. Developers will then likely have about six weeks to get their entries in. The results will be announced at the next IDASH privacy and security workshop, which will be held next year on March 16 at UC, San Diego.
Meanwhile, the results of the first challenge called the Critical Assessment of Data Privacy and Protection (CADPP) should be published next month. The challenge opened in December last year and ran until the end of February 2014. The results were presented in March at the IDASH privacy workshop, which was hosted by UCSD.
The two tasks planned for that challenge asked participants to develop "privacy preserving" algorithms that could handle genomic data and a combination of genomic and demographic data and apply them to simulated data — the genomes were generated using data from the International HapMap project and the Personal Genome Project. The challenge essentially involved devising methods of sharing statistics about SNPs in a way that would be robust to attempts to try to tie the information back to its source, Malin explained to BioInform.
According to the CADPP site, the first subchallenge focused on just the simulated genomic data and asked participants to devise algorithms that met predefined "differential privacy criterion" and allowed users to perform computations with "the minimum level[s] of information exposure … while maintaining the maximum level of accuracy in computing results."
The second CADPP challenge — which involved both genomic and demographic data — used the same criteria as those of the first task, but also tested the capabilities of the algorithms on subpopulations of individuals stratified by categories such as age and gender. A total of six teams from several North American universities participated in this year's contest submitting methods for either the first or second task. An entry from a team at McGill University performed the best in the first task, while a team from the University of Texas, Dallas took top marks in task two.
Computing on coded genomic data
As genomic data is published and shared, preventing fraudulent use and ensuring that the privacy of the donors is respected are major concerns for the biomedical community. These issues, in part, prompted the launch of the IDASH challenges, Malin told BioInform. A second motivator was the growing number of papers published in the last few years focused on using cryptographic encryption as a way to keep genomic data secure as it's used and explored for research purposes. Malin and colleagues published at least one of these in 2008. It made sense for IDASH, which has been working on tackling issues of genomic data privacy and security, to create a forum for the community to work together on these issues, he said.
Last year, researchers from the Whitehead Institute for Biomedical Research, Baylor College of Medicine, and Tel Aviv University published a study in Science where they showed that it is possible to deduce the identities of participants in public sequencing projects from their de-identified genetic materials. Their findings prompted the National Human Genome Research Institute and the National Institute of General Medical Sciences to relocate age information from the publicly accessible portion of the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research to a controlled access location. But this is not the only case where individuals have successfully tied de-identified data to its source.
However as omics datasets continue to grow, so do the costs of storing and computing on these data locally. Moreover, siloed datasets constrain the kind of sharing and collaboration that is critical to the biomedical research enterprise. The cloud is an attractive solution offering storage and compute power at lower costs and also supporting data sharing and fostering collaboration. The National Cancer Institute, for example, is exploring the cloud as a means of providing compute and storage for data from its funded projects. It recently awarded contracts to the Institute for Systems Biology, Seven Bridges Genomics, and the Broad Institute, to develop infrastructure for the so-called Cancer Genomics Cloud pilots initiative.
Meanwhile the National Institutes of Health is working on revisions to its policy for accessing and using information contained in the Database of Genotypes and Phenotypes to give biomedical researchers the option to analyze data from the repository on public cloud infrastructure.
Two talks at the first Biological Data Science Meeting held last week at the Cold Spring Harbor Laboratory highlighted homomorphic encryption — one of the methods tapped for use in the IDASH challenge — as one approach that could potentially help the community compute on protected data in more public settings. It is essentially a way of encoding data that allows third parties to compute on the coded information, unlike other cryptographic methods, thus keeping the real data safely hidden.
One of the talks was given by Kristin Lauter, the head of the cryptography research group at Microsoft Research, whose group is working on homomorphic encryption approaches, which are lattice-based cryptographic schemes. In her talk, Lauter explained how the technique could be used to securely outsource computing on protected datasets, enabling scientists to leverage remote infrastructure like the cloud for their analysis needs; and included examples from her own research to show how the technology, at this stage of its development, can be applied to relatively small genomic datasets.
Basically, homomorphic encryption allows the owner of a dataset to encode their private information as ciphertext using a public key. The user can then safely upload the data and public key to the cloud, where it can be operated on, by say a disease prediction algorithm, in its encrypted form. The results of the computation in the cloud are then returned to the owner of the data who applies a privately held decryption key to unlock the results — alternatively, users can compute on the data and then encrypt it. Either way the results are the same. As Lauter explained in her talk, the encryption process involves adding "noise" to a so-called "secret inner product" — a vector in the underlying lattice — as well as a randomly selected vector, and the information that the user wants to protect. The decryption key lets the user take out the secret inner product, cancel out the added noise, and recover the encoded information.
Lauter's talk focused on what she described as "practical homomorphic encryption" (PHE), which allows only some operations to be performed on the data based on carefully selected parameters. She contrasted this with the more computation and storage expensive "fully homomorphic" systems that support multiple kinds of computations on encrypted data and allow them to run arbitrarily. She further noted that the PHE method may be practical for some kinds of analyses but may not be the most effective encryption approach in other cases.
This mode of encryption can be used with supervised machine learning approaches, such as a linear means classifier, as well as to run various statistical tests among other tasks. It's an approach, Lauter said in her talk, that could potentially be employed in direct-to-patient services, for example, allowing individuals to encrypt their personal genomes, make them available for computation in the cloud, and get encrypted feedback that only the owner sees once they apply the decryption key.
It could also potentially be used by hospitals and clinics to encrypt patient data prior to moving it to the cloud and allowing third parties to operate on it. The owner of the data in this scenario would have some policy in place for sharing the decryption key with authorized third parties and allowing them to decrypt the results.
Lauter's team also worked on ways to encode data for more efficient computation, including methods of packing ciphertext — essentially packing lots of bits into a single ciphertext.
Lauter and her team used their PHE scheme to encrypt and analyze data in various domains. In her CSHL presentation, Lauter focused on the results of her team's efforts to compute on encrypted genomic data. Among others, they ran statistical tests such as the Pearson Goodness-of-fit; the Cochran-Armitage test for trend; and linkage disequilibrium. Using one set of parameters, their approach computed these tests on genomic datasets in a second or less.
A second test involving larger parameters returned results in under a second for linkage disequilibrium, and in approximately 1.4 seconds and 3.6 seconds for the Pearson Goodness-of-fit and the Cochran-Armitage tests respectively — those numbers were independent of the actual size of the datasets and depended on parameters for running the computation that the researchers selected. She also showed the results of applying the method to a sequence alignment problem using edit distance. It took 27 seconds to align a sequence of length eight nucleotides, and that suggests that PHE might not be the most effective encryption approach to use for this kind of analysis, Lauter said. A more effective alternative, she suggested, might be a multi-party computation technique.
A second presentation at the conference focused on additive homomorphic encryption described a protocol for encrypting sensitive research queries before sending them off to the relevant repositories. During the search, users' queries remain encrypted and the database returns results without seeing the query. It’s a collaborative study being done by researchers from the Memorial Sloan Kettering Cancer Center and Japan's National Institute of Advanced Industrial Science and Technology (AIST). As detailed in the conference abstract and in the conference talk, the researchers have developed cryptographic protocols that allow drug development organizations to search for similar chemical compounds in vendor libraries and to enable users to search genotype databases for variants and annotations of interest.
In her presentation, Kana Shimizu, a senior research scientist in AIST's Computational Biology Research Center and one of the developers of the encryption protocols, said that the colleagues tested their chemical compound search protocol on a single core computer using it to query a dataset of over 1.2 million compounds in the ChEMBL database and compared it to a protocol for secure multiparty computation called Yao's garbled circuit. A single search query using their method took about two minutes, which is about 50,000x faster than the other approach, and the size of the data transferred back from the server was much smaller compared to the other method — about 12,000x smaller, she said. The team is making this particular protocol available in a tool that will be called ChemCrypt and publishing a paper on the method soon, Shimizu told BioInform in an email.
They also have developed pilot protocols that let users search for particular genomic positions in variants — a paper describing them in detail is planned for a later date. In one of these, users send a Beacon, essentially the specific sequence of ACGTs that they are looking for, and the server responds if it has data that matches the users' request, Shimizu explained. One more application lets users search for consecutive SNP matches. Users send a query sequence to a database that contains consecutive SNPs from many genomes all aligned by genomic position, and the server responds with a list of matching relevant sequences.