NEW YORK (GenomeWeb) – Scientists from consumer genomics firm 23andMe for the first time described the Mountain View, Calif.-based company's Ancestry Composition tool in a paper.
Published in BioRxiv, a Cold Spring Harbor Laboratory-hosted archive of life sciences-related preprints, the paper discusses 23andMe's three-tier bioinformatics model for parsing genotyping microarray data, resulting in the labeling of chromsome segments according to ancestral origin across 25 different populations worldwide.
Since the paper is in preprint stage, a 23andMe spokesperson declined to comment on its contents. The authors, however, argued that their approach offers a "flexible, robust, and easy-to-update" alternative to existing ancestral deconvolution tools.
"Our cross-validation experiments showed that Ancestry Composition achieves high precision and recall for populations separated by continental and subcontinental-scale distances," the authors noted in the paper.
The ability to identify the origin of an admixed individual's chromosome segments is paramount to consumer genomics firms looking to attract customers who are interested in their ancestral composition, especially in North America, where many people are of mixed ancestry.
Most of the major consumer genomics service providers, including Ancestry.com, Family Tree DNA, National Geographic's Genographic Project, and 23andMe, offer ancestral deconvolution, also known as biogeographical analysis, and continue to update their offerings to provide greater detail. All of these offerings are carried out using Illumina SNP genotyping microarrays.
Last year, Ancestry.com's AncestryDNA business upgraded its offering to provide ethnicity estimates based on 26 populations. The Provo, Utah-based firm had offered an ethnicity estimate based on 22 populations at the time the microarray-based service was first rolled out in 2012.
More recently, Gene by Gene's Family Tree DNA business in April upgraded the myOrigins biogeographical analysis portion of its Family Finder autosomal DNA testing service to relate its customers' ancestry to 18 different populations.
"I think that ancestry decomposition is definitely important to customers," Gene by Gene CSO David Mittelman told BioArray News this week. "We are committed to refining [myOrigins] as we get more data into our database," he said. "In addition to breaking down ancestry, we also overlay people's [matching relatives] geographically so they can see how their matches correlate," he noted.
What 23andMe's new paper offers, therefore, is additional insight into how one of the consumer genomics market's lead players carries out a core component of its service.
According to the paper the company's approach relies on a modular, three-stage pipeline to identify the ancestral origin of chromosomal segments in admixed individuals, assuming that the microarray genotyping data has first been phased. First, preliminary ancestry assignments are obtained via support vector machines using a string kernel. Next, these assignments are processed using an internally developed autoregressive pair hidden Markov model that corrects misassignments and phasing errors. Finally, it employs isotonic regression to calibrate the assignments globally, which controls overall false positive and false negative error rates, according to the authors.
23andMe developed its approach to overcome perceived limitations of existing approaches, such as HAPMIX and LAMP-ANC, which can resolve ancestry from a small number of populations with large genetic distances, such as European, West African, and Amerindian populations, but cannot provide the kind of subcontinental-level detail in which ancestry testing customers are interested, especially European-Americans.
Indeed, "while it is relatively uncommon for an individual to have ancestors from more than two or three populations separated by continental-scale distances, it does not appear to be uncommon for individuals to have genetic contributions from multiple populations separated by subcontinental-scale distances," the authors noted.
As part of the development of Ancestry Composition, 23andMe compiled a reference panel of almost 10,000 individuals of homogeneous ancestry, derived from a combination of several publicly available datasets and more than 8,000 individuals reporting four grandparents with the same country of origin from its database of participants.
In ensuing cross-validation experiments, the company reported that Ancestry Composition achieved "high precision and recall for labeling chromosomal segments" across more than 25 different populations worldwide.
At the same time, there appear to be some limits to ancestral deconvolution at the sub-continental level. The authors noted that while its classifier could distinguish Northern European from Southern European and Eastern European, and, in some circumstances, regional populations such as British and Irish, Scandinavian, and Iberian, it still exhibited poor coverage in a French-German reference population.
The authors hypothesized that either the classification was erroneously not made, or that some haplotypes are "more cosmopolitan" than others, meaning that geographically isolated populations, such as Finns or Ashkenazi Jews, may have a higher proportion of region-specific haplotypes, while highly-connected central European populations may have relatively few of such private haplotypes.
The authors said that it therefore may never be possible to deconvolute some ancestries by origin because "some proportion of haplotypes are fundamentally shared between populations."
In the paper, the authors claim that Ancestry Composition is the first such approach to use a recalibration method. They further claim this allows the reassignment of undetermined subcontinental populations to larger groupings, so that if a segment is classified equally into three distinct European subpopulations, the recalibration method will reassign the haplotype to a "higher level population" such as Northern European, or even just European.
Without a recalibration method, the segments would remain unassigned or be assigned to the first matching subpopulation.