This article has been updated to correct inaccuracies in previously reported research affiliations. Although the Whitehead Institute is affiliated with MIT, it is an independent entity and therefore separate from the institution.
Researchers from the Whitehead Institute for Biomedical Research, Baylor College of Medicine, and Tel Aviv University have published a study in Science showing that it is possible to deduce the identities of participants in public sequencing projects from de-identified genetic material.
What’s perhaps most unsettling about the study is that the researchers used genetic and demographic information that is freely available from publicly accessible internet resources.
The paper explains that the researchers obtained people’s surnames from their genomic data searching for short tandem repeats on the Y chromosome in recreational genetic genealogy databases. Then, by correlating the surnames with other kinds of metadata such as age and state, they were able to “triangulate the identity of the target.”
The study has already prompted the National Human Genome Research Institute and the National Institute of General Medical Sciences — both of whom reviewed the findings prior to publication — to relocate age information from the publicly accessible portion of the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research to a controlled access location.
Furthermore, officials at NIGMS and NHGRI published a separate article that also appears in the same issue of Science, in which they call for a re-examination of “current paradigms for managing the identifiability of genomic and other omic-type data.”
While the results do point to “the potential for breaches of privacy in genomics studies,” they in no way provide a reason to curtail efforts to share research data, nor should they deter individuals from participating in genomics studies or submitting their data to genealogy websites, according to Yaniv Erlich, a fellow at the Whitehead Institute and the lead on the study.
The answer isn’t to close public research databases or lock information behind firewalls, he told BioInform. These resources are “extremely important and public data sharing has a lot of benefits for the genetics community, for researchers, and for society.”
He said that the researchers’ intent was to “illuminate” the privacy questions that dog genomics projects in an effort to spark public discussions about how to ensure the security of genomic data and to ensure that research participants are properly educated about the risks of making their personal genetic information public prior to consenting to participate in research studies.
It’s also important to present a “balanced view” to research participants that highlights both the risks and the benefits of participating in genomics research, he said.
Similarly, NHGRI and NIGMS officials called for a dialogue between research participants, researchers, clinicians, advocacy groups, and other stakeholders focused on balancing research participants’ privacy rights with the societal benefits that could be gained from better research enabled by data sharing.
What's in a Name?
Erlich said his interest in data security issues dates back to his undergraduate years when he worked for a computer security company that was tasked with checking the robustness of banking systems.
He told BioInform that the idea for this particular project came from a news article about a 15-year-old boy who successfully traced his biological father online after submitting a sample to the genetic genealogy service Family Tree DNA. Data from the young man’s Y chromosome matched two individuals in the database, both of whom had last names similar to his own.
“I thought … can we do the same thing from whole genome sequencing data?” Erlich said.
Using lobSTR, an algorithm that Erlich and his colleagues developed for profiling STRs, the researchers obtained Y-STR haplotypes from samples submitted to the French Center for the Study of Human Polymorphisms, CEPH, study that were analyzed by the University of Utah.
Because the Y chromosome is transmitted from father to son, as are surnames, there is a strong correlation between surnames and the DNA on the Y chromosome. Recognizing this correlation, genealogists and genetic genealogy companies have established publicly accessible databases that allow users to search for matching records using Y-STR alleles. These results often contain surnames along with other data such as geographical and pedigree information
Using a method known as "surname inference," the researchers were able to discover the family names of the men by submitting Y-STR haplotypes to genetic genealogy databases.
The team then traced the individuals using the recovered surnames by seeking matches with information pulled from internet sources, such as public record search engines, obituary archives, genealogy websites, and demographic metadata from the NIGMS Human Genetic Cell Repository.
All told, the researchers were able to identify nearly 50 men and related women in the US, according to the paper.
The researchers do not intend to disclose the names of the individuals who were identified by the study.
Meanwhile, the NIH said it informed the primary investigator for the original CEPH study collection at the University of Utah about the work done by Erlich’s team and that he, in turn, had contacted his institutional review board.
In an email, Laura Lyman Rodriguez, the director of NHGRI’s division of policy, communications, and education, told BioInform that the university’s IRB would handle any “decisions related to the participants’ interests and any possible contact in this situation.”
The Risk of Exposure
Because the approach exploits paternally transmitted genetic characteristics, the team noted that genetic data from a single individual can reveal deep genealogical ties and result in the identification of a distantly related person who may have no acquaintance with the person who released that genetic data.
That means that "if, for example, your Uncle Dave submitted his DNA to a genetic genealogy database, you could be identified," Melissa Gymrek, a member of Erlich’s lab and the first author of the Science paper, said in a statement. "In fact, even your fourth cousin Patrick, whom you've never met, could identify you if his DNA is in the database, as long as he is paternally related to you," she added.
And this risk of unwanted exposure is only going to grow, Erlich and colleagues believe.
Currently there are “at least eight databases and numerous surname project websites that collectively contain hundreds of thousands of surname-haplotype records,” they wrote.
Adding fuel to fire, “genetic genealogy enthusiasts add thousands of records to these databases every month,” the researchers wrote. Meanwhile, ”the advent of third-generation sequencing platforms with longer reads will enable even higher coverage of Y-STR markers, further strengthening the ability to link haplotypes and surnames,” they said.
So What’s the Solution?
The researchers believe that addressing this issue will require clearer data sharing policies, better education for participants about the benefits and risks of genetic studies, and legislation that guides proper use of genetic data.
Commenting on the issue of participant education, Erlich noted that the re-consent documents provided to the CEPH study individuals — asking permission for their samples to be included in the HapMap and 1000 Genomes projects — were upfront about the risks of re-identification.
In fact, the reason Erlich and his colleagues were able to use the data was because the study individuals had already consented to the use of their data in spite of these risks, he said.
But such disclosure isn’t the case in every research study that uses genomic information, he said.
When volunteers consent to participate in research studies, they should be given all the facts so that they can understand “where we are right now” in terms of data security and then they can decide whether to participate or not, he said.
In terms of legislation, Erlich and his colleagues believe that the conversation needs to move beyond just protecting people’s privacy and toward polices that will guard against the misuse of genetic data, which is one of the more dominant concerns.
It would also be beneficial, he said, to develop new algorithms that can protect data without placing undue burdens on data-sharing activities.
The researchers have tried a few methods of data protection. For example, they explain in the paper that they looked into masking Y-STRs but then abandoned that idea when they realized that it’s possible to “impute” the Y-STR haplotypes from SNPs that are also on the Y chromosome.
In fact, one project has already begun exploring association between Y-SNPs and surnames, which could make it possible to bypass masked Y-STRs, the paper states.
Another option is restricting genetic genealogy information, but the researchers believe that is an impractical solution because the data are “scattered in multiple end-user websites and genealogy mailing lists.”
“We didn’t find a method, although we tried for a few months,” Erlich said. But “I am sure there are people in the community that maybe can do better than us.”
Meanwhile, the NIH is taking steps of its own, according to NHGRI’sRodriguez.
“This issue has been an area that we and others at the NIH have been thinking about and monitoring quite actively for some time, both through the discussions and oversight activities associated with developing and implementing the current data-sharing policy for genome-wide association studies” as well as through “more focused conversations with the community,” she said.
“We will be sharing [both Science papers] with our colleagues in these efforts,” she said, “so that we can discuss the scientific findings with them in depth and the options for moving forward with the dialog that is proposed in our policy commentary.”