NEW YORK – Issues of data privacy and patient identifiability have concerned the genomics space for some time now, but new regulations and advancing technical capabilities could force proteomics researchers to confront these questions, as well.
While studies have shown that a person can be linked to their DNA using data on a relatively small number (around 20) of single nucleotide polymorphisms, this fact has been less relevant to proteomics given the traditional difficulty of reliably identifying large numbers of single amino acid variants (SAAVs) in human samples and the fact that proteomics has undertaken far fewer large scale clinical projects than has genomics.
In recent years, however, advances in mass spectrometry have improved researchers' ability to identify SAAVs while improvements in throughput and sensitivity have made large-scale clinical projects more feasible. At the same time, new regulations, most notably the EU's General Data Protection Regulation (GDPR), have strengthened requirements around data protection, including some clinical information like human genomic data that has not been irreversibly deidentified.
Given these factors, "I think we need to expand the discussion [around patient privacy] to proteomics, as well," said Gokhan Ertaylan, a principal investigator at the Flemish Institute for Technological Research (VITO). "And these discussions are currently lacking at the moment, to be perfectly honest."
Ertaylan and colleagues published a commentary in Genes in September looking at potential issues around patient privacy and sharing of clinical proteomics data.
In the article, the authors noted that proteomics information "is currently treated as non-personal data in the scientific community" including by organizations like the National Cancer Institute. This is due to the fact that historically, proteomic approaches were not fine grained enough to identify different protein isoforms including SAAVs and other genetic alterations on a large scale.
For instance, antibody-based methods are generally not able to distinguish between proteins with single amino acid variations unless antibodies generated specifically for doing so are used. In the case of mass spectrometry-based approaches, traditional bottom-up experiments analyzed only small portions of the larger proteins being identified, meaning that they could not reliably detect variants or other alternative forms.
More recently, though, researchers have leveraged improvements in mass spec technology to enable de novo protein sequencing through which they are able to collect more complete amino acid sequence information for the proteins they are analyzing. In theory, this could lead to a situation like in genomics where supposedly anonymous clinical study participants can be identified by their protein variants.
Mahsa Shabani is an assistant professor of privacy law at Ghent University (UGent) who has studied the issue of genomic data and patient privacy, publishing a paper last year in EMBO Reports looking at how reidentifiability of genomic data might be handled under the GDPR.
"I think the approach that we took to study the reidentifiability of genomics data is also of interest to proteomics data, as well," she said.
Shabani said clinical proteomics researchers would also do well to consider how the data they generate could provide information about a subject's health status beyond the measures being looked at in a given study and how that data should be treated — much the way the genomics field has wrestled with questions around when and how to return incidental findings to study participants.
"I think we are moving from the point where genomics was viewed as a special case to seeing the importance of [caution around] processing omics data in general and making sure that you are aware of the privacy risk and you know that there are safeguards in place to protect the privacy of those individuals who the data belongs to," she said. "I see attention to this increasing."
"There's a general trend in the proteomics world that we want to have larger datasets, and so a lot of people are now working with big cohorts of patients," said Kurt Boonen, a researcher at the University of Antwerp and first author on the Genes editorial. "Also, the new mass specs are much more sensitive. So, combine that with the fact that we will be able to in the next several years do proteomics experiments on thousands of patients, and it makes the issue pretty pertinent."
Boonen said that proteomics researchers who regularly work with patient data are aware that patient privacy is a potential issue, but, he said, "the question is, who is going to solve it?"
He suggested that proteomics data repositories are an obvious choice to lead such work, but added that a broader effort by the larger proteomics community was probably needed.
"It's a lot of work for a repository" to tackle this issue," Boonen said. "So I think they are waiting for the proteomics community to propose an answer."
He added that another consideration is the need to standardize practices globally so that researchers can easily share data with one another throughout the world.
"Especially with Europe's strict rules [based in the GDPR], if we want to have collaborations with the US, it would be nice to have a solution that fits everyone so that we can work together without too many problems," he said.
Juan Antonio Vizcaino is the proteomics team leader at EMBL-EBI where he and his team are responsible for the PRIDE proteomics data repository.
Broadly speaking, Vizcaino said the proteomics field as a whole doesn't appear particularly worried about these questions.
"So far it is only a very, very small percentage of people who have expressed these concerns," he said.
Nonetheless, it is an issue that he and his colleagues have been thinking about. He said that he received his first inquiries about patient privacy and proteomics data a few years ago and that the implementation since then of the GDPR has made the topic especially relevant.
Last spring, the PRIDE team along with other proteomics researchers like Lennart Martens, group leader of the computational omics and systems biology group in the VIB-UGent Center for Medical Biotechnology, organized a workshop to discuss the field's current thinking on the topic.
The organizers followed that with an additional workshop in August, and Vizcaino said that he is in the process of putting together a white paper summarizing their findings.
He said that one of the challenges to addressing the issue is the fact that there is very little research showing current proteomic technology could present problems with patient privacy.
"It is expected that because of the new technology this will become an issue and patients could become identifiable, but there is very little research on that," Vizcaino said. "I think that is one of the things that is needed, to have studies really assessing what is happening. Because at the moment everyone has a slightly different opinion, so we definitely need more scientific evidence."
Another issue Vizcaino said he had run into while putting together the white paper is the fact that of the relatively few people with expertise in data privacy laws and biological data, nearly all of them are focused on genomics.
"It has been very difficult to communicate the particularities of proteomics to people who are experts in this kind of policy or legislation," he said.
In the Genes editorial, Ertaylan, Boonen, and their co-authors provided an example of the sort of data privacy challenge that would impact proteomics more than genomics, noting that because experimental workflows in proteomics "are more prone to experimental errors when compared to genomics … open access to the unprocessed data is often required by scientific publishers to preserve the integrity of data quality."
Additionally, Vizcaino noted that while the EU has a shared set of guidelines in the GDPR, "every country is implementing this in a different manner," causing additional confusion. Some countries, like Sweden, are especially strict, he noted, while others are more relaxed. In fact, he said, last year PRIDE locked access to a dataset from a Swedish research group due to privacy concerns.
The group initially submitted their data to the repository but then learned from their university's data access committee that it could not be made publicly accessible, Vizcaino said.
"In the end, it can depend even on the individual university, beyond just the country," he said, adding that there was nothing particularly unique about the Swedish data set that distinguished it from other clinical proteomics data that PRIDE has made publicly accessible.
Vizcaino said that this was the only case he was aware of where PRIDE had removed a proteomics dataset due to data privacy concerns, but he noted that some groups doing clinical research might simply not submit their data in the first place.
"There could be a lot of clinical studies that don't ever get submitted because [the researchers] have these privacy concerns from the beginning," he said. "That definitely could be happening, but the number of cases is difficult to estimate."
Beyond proteomics, Vizcaino said he has received inquiries about whether patient privacy concerns might apply to clinical metabolomics data, as well.
As he noted, more research is required to determine the extent of the challenge, but in the meantime Vizcaino and his PRIDE team have begun working with colleagues at the European Genome-phenome Archive on alternative data submission practices for clinical proteomic data "in case this really becomes a huge issue," he said.
"I think this is going to be a topic that gives me a lot of headaches in the coming years," he said with a laugh.