NEW YORK (GenomeWeb) – Small sets of SNPs that lie close to one another in targeted high-throughput sequence data may serve as an untapped source of forensic information, according to researchers from Yale University School of Medicine and Thermo Fisher Scientific's Human Identification Group.
In a study that appeared online in Forensic Science International: Genetics earlier this month, the team provided proof-of-principle evidence that such microhaplotype loci — comprised of two to four SNPs clustered within 200 bases of sequence or less — can be used to not only identify individuals, but also discern their ancestry and relatedness to others.
In particular, the team highlighted a few dozen informative microhaplotype loci, or microhaps, that it tracked down and validated in samples from thousands of individuals from more than 50 human populations.
Compared to other SNP-based forensic strategies that have been proposed in the past, the microhap sequencing method makes it possible to get information on identity, ancestry, and kinship, while simultaneously seeing DNA mixtures that may occur within a given sample, the study's first author Kenneth Kidd, a Yale University genetics researcher, told In Sequence.
He and his colleagues are continuing to search for and refine a set of forensically useful microhaps in the hopes that targeted microhap sequencing may one day replace or complement existing forensic techniques that rely on short tandem repeat-based Combined DNA Index System (CODIS) markers.
For his part, Kidd is keen to see the development of targeted sequencing panels that include microhaplotype sites — alongside as many CODIS loci as possible — as a way of continuing to use the FBI's current database of offender CODIS marker profiles, while simultaneously moving towards a more SNP-centered identification scheme.
"The forensics community is emotionally locked into the tandem repeat because of the huge database of offenders," Kidd said.
Nevertheless, he noted that forensic analysis of CODIS markers in DNA from a person of interest often doesn't lead to hits with sequences in the database. Although CODIS loci can offer kinship clues, small tandem repeat polymorphisms typically don't provide much of a look at individuals' physical features or ancestry.
"The standard CODIS markers — and even the expanded set that's projected to be implemented soon — provide virtually no information on ancestry because all populations have most, if not all, of the alleles," Kidd said.
Moreover, he argued that the capillary electrophoresis technology usually used to assess short tandem repeat polymorphisms in the CODIS database is approaching obsolescence.
Anticipated technological shifts, coupled with the types of information that can be gleaned from SNP and/or sequence data, have prompted interest in standardized SNP-based identification methods.
Still, the notion of incorporating SNPs into forensic assessments and databases raises questions about the type of variants that should be considered and the most convenient, cost-effective methods for interrogating them in a forensics setting.
Initially, Kidd said, there was a thought of using arrays alongside traditional capillary electrophoresis experiments to generate supplementary SNP data while maintaining information at CODIS marker sites assessed by capillary electrophoresis.
But that idea is "almost an impossible sell," according to Kidd, due to the differences in equipment, training, and quality control considerations between array and capillary electrophoresis methods.
Instead, he advocates transitioning directly to a targeted sequencing method that provides information at CODIS sites, while introducing a newly established set of forensically informative variants.
"All of these markers can be incorporated into sequencing technology," Kidd said.
With the availability of high-throughput sequencing instruments that can produce reads in the 200-base pair range, he and his colleagues reasoned that it should be possible to see many of the short tandem repeat markers considered by CODIS. The same read lengths offer a look at SNP-based sources of identification data, too — particularly variants that make up very small haplotypes.
By finding sequences shorter than 200 bases that contain identity-, ancestry-, or phenotype-related variants, Kidd and his colleagues explained, it becomes possible to see these SNPs in their haplotype context using short-read, high-throughput sequence data.
Microhaplotype phasing in highly redundant short-read sequence data, in turn, makes it possible to pick out the presence of DNA from more than one individual, they argued, giving microhap sequencing an edge over array-based SNP profiling in forensics.
The team noted that future improvements to sequencing technology may make it possible to tease apart longer and more complex microhaplotypes and haplotypes as read lengths stretch out to allow alleles to be assigned to one DNA strand or another.
At the moment, Kidd noted that whole-genome sequencing does not seem to be particularly well suited to forensic applications given the complications associated with assembling genomes.
In situations where whole genomes are available, it's possible that researchers may be able to extract other types of information, he explained, but that requires far more analysis and far more computational horsepower.
"I don't envision — given the quality and accuracy requirement in forensic applications—that it will go to whole-genome [sequencing]," Kidd said. Rather, he noted that there appears to be interest in establishing targeted sequencing panels that are specifically focused on ancestry applications.
For their FSI:Genetics study, the researchers sifted through related papers and existing sequence databases to look for informative microhaps made up of two or more SNPs contained in a couple hundred nucleotide of sequence.
That search led to hundreds of possible microhaplotype sites. The team subsequently ranked and started analyzing 50 of those candidate sites in a lymphoblastoid cell line collection produced using samples from more than 2,500 individuals from 54 populations.
In the process, the researchers verified 31 of these microhaplotype sites. Though they emphasized that that collection was not selected specifically with an eye to ancestry, the microhap set still provided some ancestry information along with insights into identity.
The group is continuing to sort through their candidate microhap collection to see if other loci provide more or less reliable information about identity and ancestry. It is also in the process of finding and validating even more microhaplotypes, with the idea that some of the sites will drop off as the markers are tested in increasing numbers of people from different populations.
Since work for the current study was completed, for example, the researchers uncovered additional microhapolyte sites containing as many as four alleles within fewer than 200 bases.
The study's authors are not currently collecting additional human samples. Rather, they hope to inspire interest in microhaplotype research by other groups, with an eye to eventually moving the forensics field in that direction.
On the forensics interpretation side, the team established a "forensic resource on genetics," or FROG, database several years ago as a prototype for using SNPs in forensic applications.
FROG taps into the group's "allele frequency database," also known as ALFRED, which tallies the frequency of alleles in various human populations, along with links to related literature, SNP databases, and molecular definitions.
That underlying resource has been largely unfunded and uncurated, Kidd said, though new funding earmarked for the site is expected to kick in sometime this year.
He noted that additional research, validation, and database development will be needed if microhaplotype sequencing or other SNP-based techniques are to take off in the forensics field in the US.
Though the current CODIS database is constantly updated, Kidd argued that it should theoretically be possible to freeze that resource in its existing state and begin transitioning over to a more SNP/microhaplotype-centered database to replace CODIS over the coming decades.
The cost of targeted microhap sequencing and/or CODIS loci in forensic samples is still up in the air, though the price tag is expected to depend on the number of sites considered and the sequencing technology available if or when the method is more widely adopted.
Kidd's team has no plans to file for patents related to the microhaplotype method or to pursue commercialization, though the group is collaborating with investigators from companies that may pursue products with forensic applications.