Skip to main content
Premium Trial:

Request an Annual Quote

New Computing, Security Protocols Could Enable 'Genome Crowdsourcing' for GWAS

Premium

CHICAGO (GenomeWeb) – A computational method and security protocol devised by computer scientists and mathematicians from Massachusetts Institute of Technology and Stanford University promises to open the door to larger and more accurate genome-wide association studies.

In a research letter published this week in Nature Biotechnology, the researchers described how multiparty computation — where no single computational entity has access to complete sets of the original data — enabled what they dubbed "secure genome crowdsourcing."

This protocol allows people to "share their genomes in a blinded fashion, so people can't see the actual genome or get any information from the genome about them, about medical status, about potential problems in the future," explained corresponding author Bonnie Berger, head of the computation and biology group in the MIT Computer Science and Artificial Intelligence Laboratory. "It enables the study participants to donate their genomes in a completely, provably secure way."

The authors, including Berger, MIT graduate student Hyunghoon Cho, and Stanford grad student David Wu, were able to ensure security by employing a technique called secret sharing.

"Secret sharing allows multiple parties to collectively represent a private value that can be revealed if a prespecified number of parties combine their information, but remains hidden otherwise," according to the Nature Biotechnology letter. "Using this technique, private individuals can freely contribute their genomes to the computing parties in our GWAS protocol, without giving anyone access to the raw data."

This is achieved by assigning random numbers to multiple computing parties to allow each to crunch aspects of the data without seeing the underlying genotype, called X for this example. This technique is known as multiparty computational review.

"Sharing all the results together would you give you a sum of all the Xs. Neither server would explicitly observe the value of any one of the Xs," Berger said. "So long as just one of the servers is trustworthy — and that server, mind you, never sees the underlying genotype data," the information cannot be stolen.

This example assumes that only addition is involved. "It turns out that we need to do multiplication for GWAS, and that's more complicated," Berger noted.

To address this, Cho, Wu, and Berger applied a cryptographic technique called Beaver triples, first described in a 1991 paper by computer scientist Donald Beaver. This involves assigning random numbers to three or more computing parties. The values are secretly shared, but together, the results reveal the desired product

"Now, the problem is, when you're doing GWAS, you have a huge matrix. Every individual has a row of that matrix, and we need to generalize," Berger said. Her team came up with a method of generalizing collections of Beaver triples in a way that can handle matrices as large as 1 million SNPs by 1 million individuals.

"We added more complicated techniques such as another computer party where we can just compute one random seed and then separate servers can compute their random numbers from this seed so we don't have to pass around random numbers all the time," she explained. "We did all of these … clever algorithmic optimizations around the Beaver triples without having to pass so much data around."

They also applied a technique called randomized principal component analysis for population stratification to help control for different populations. "We're not really dealing with a 1 million-by-1 million matrix," Berger said. Instead, they are sampling from the matrix.

"I think this is a game changer in a lot of ways," Berger said. "Previous cryptographic protocols could not scale, and they couldn't handle population stratification, which is so important for accuracy in genomic studies."

The MIT lab has released the source code for these algorithms. "We are happy to license the software to companies, but it's free for academic or nonprofit use," Berger said. "I would like the software to be used for the advancement of science."

Berger expressed a hope that this technology would unleash greater sharing of genetic information for ever-larger studies. "We want collaboration because it greatly increases the accuracy of the GWAS, and it's just so hard to get a hold of large-scale GWAS studies to inform biomedical investigation," she said.

"I foresee a distributed environment where individuals kind of donate their genomes to science, knowing that [the system] is provably secure," Berger said. This could eventually include consumer sequencing companies like 23andMe and Ancestry collecting blinded genotypes and phenotypes to make available to researchers or, better yet, such firm allowing their customers to designate whether they wanted to share blinded data.

"It opens the door to much more accurate and better GWAS studies," Berger said.