NEW YORK (GenomeWeb) – The physical traits predicted from genome sequence data may be sufficient to identify anonymous individuals in the absence of other information, according to a study set to appear in the Proceedings of the National Academy of Sciences this week.
After looking for links between physical phenotypes and whole-genome sequence data for more than 1,000 individuals from a range of ancestral groups, researchers from the US and Singapore took a crack at predicting biometric traits based on genetic data with the help of a newly developed algorithm. In a group of de-identified individuals, they reported, the algorithm made it possible to identify a significant proportion of individuals based on predictions of three-dimensional facial structure, ethnicity, height, weight, and other traits.
"By associating de-identified genomic data with phenotypic measurements of the contributor, this work challenges current conceptions of genomic privacy," senior author Craig Venter, of Human Longevity and the J. Craig Venter Institute, and his co-authors wrote. "It has significant ethical and legal implications on personal privacy, the adequacy of informed consent, the viability and value of de-identification of data, the potential for police profiling, and more."
While links between genome sequences and physical features and/or disease are necessary for tapping into the potential benefits of personalized medicine and other genomic applications, the team explained, more refined trait prediction methods are also expected to heighten the need for data safeguards.
Indeed, the authors argued that current genetic privacy systems are "fragmented" and "may prove insufficient" as methods for matching sequences and allele patterns to individuals improve.
To get an idea of how far off phenotype-from-genotype predictions may be, Venter and his colleagues began by examining ties between whole-genome sequences and physical phenotypes using statistical models for estimating three-dimensional facial structure, voice features, biological age, height, weight, body mass index, eye color, skin color, baldness, and/or hair color in 1,061 individuals from African, European, Latino, East Asian, South Asian, and other ancestry groups.
When they applied individual prediction models and a consolidated learning model to the genome sequences — covered to 30-fold depths, on average — the researchers found that genetically simple traits such as eye or skin color could be predicted quite accurately. But other traits could be teased out, too.
Based on somatic mutations, mosaic sex chromosome loss, and telomere shrinkage, for example, the team could begin making age estimates for the individuals. From SNP patterns in the genomes, meanwhile, the researchers made relatively robust height predictions, though weight and BMI were trickier to tease out. The genome sequence data also made it possible to make face shape and vocal feature predictions that, in turn, provided further clues about an individual's age, sex, and ancestry.
From there, the researchers came up with an approach for integrating the predictive information from genome sequences and used what they called a maximum entropy algorithm to try to match other individuals to genome sequences via phenotype.
In a randomly selected group of 100 individuals, they reported, it was possible to go from a genome sequence to a subgroup of 10 people containing the sequenced individual roughly 88 percent of the time — inching closer to individual identification from genomic data.
That knowledge may be useful in the forensics context, the authors noted. Still, they cautioned that the results raise serious questions about the appropriate use and protection of genome sequences, which are not currently protected as identifying data under the US Health Insurance Portability and Accountability Act's Safe Harbor method for ensuring anonymous and de-identified patient information.
"If conducted for unethical purposes, this approach could compromise the privacy of individuals who contributed their genomes into a database," Venter and his colleagues concluded. "Although sharing of genomic data is invaluable for research, our results suggest that genomes cannot be considered fully de-identifiable and should be shared by using appropriate levels of security and due diligence."