Skip to main content
Premium Trial:

Request an Annual Quote

Customers of Ancestry Testing Services Turning to DIY Analysis Tools to Reanalyze Raw Array Data

Premium recently fulfilled an earlier pledge to make available to its customers the raw microarray data generated as part of its AncestryDNA genetic genealogy service.

As such, AncestryDNA's customers join clients of other firms and organizations that offer microarray-based ancestry testing services, such as 23andMe, Family Tree DNA, and National Geographic, in being able to reanalyze their array data using freely available online tools. It's a trend that some experts see as an "unambiguously good thing" that puts data analysis in the hands of so-called "citizen scientists," but that others believe merits caution with regards to data interpretation and privacy protection. introduced AncestryDNA last year. The service uses the Illumina HumanOmniExpress BeadChip platform to assess each sample across 700,000 markers. However, the launch was not without controversy. Some criticized the firm for not making the raw array data generated immediately available for customer download, though an spokesperson told BioArray News at the time that it intended to introduce such a feature by the spring of 2013 and had always planned to do so (BAN 10/23/2012).

True to its word, the company began making this raw data available sometime in the past few weeks. Stephen Baloglu,'s director of product marketing, disclosed the raw DNA download feature in a March 24 post on the family history company's blog.

An spokesperson told BioArray News this week that the company decided to make raw data available to customers because of ethical considerations. "We made the raw data available to its owners because we believe our customers should have access to their own genetic data," the spokesperson said.

At the same time, other companies and organizations that offer microarray-based genetic genealogy services also allow customers to download their data so that they can reanalyze it using other tools available online.

For example, the website GedMatch allows customers to upload their array data and parse it through four different admixture proportion analysis tools: Eurogenes, HarappaWorld, Dodecad, and the MLDP Project. Each tool recalculates a sample's admixture proportions, and can also paint chromosomes according to projected ancestral regional origin.

Using a tool called Dodecad K7b, for example, a user can have each of his chromosomes grouped according to seven regions of origin based on his genetic profile: South Asian, West Asian, Siberian, African, Southern, Atlantic-Baltic, and East Asian. Following analysis, the user may be told that, for instance, a third of the segments on chromosome 7 are inherited from East Asian ancestors, a result that may or may not jibe with the outcome of, 23andMe, Family Tree DNA, or National Geographic's interpretation of his ancestry composition.

While each of these do-it-yourself calculators was created by developers with computational biology or computer science backgrounds, their free availability means that people who download their array data from ancestry testing services, and may have little or no scientific background, are now free to perform their own array data analysis. And that is an "unambiguously good thing," according to one calculator developer.

"For the users it is good because they can obtain different assessments of their ancestry, so they learn to be skeptical of extraordinary or unexpected claims of any particular test, and also to be more convinced of results that recur across many different tests," said Dienekes Pontikos, author of the blog Dienekes' Anthropology World.

Dienekos Pontikos is the pseudonym of a Greek computational biologist who developed the Dodecad admixture calculators, which are freely available for download as part of the Dodecad Ancestry Project, and can be run using the statistical software program R.

But while such tools seem to put the power of data interpretation in the hands of the customers, rather than the providers, Pontikos maintains that the ability for average customers to do their own data analysis using tools such as Dodecad is also good for companies that offer such services.

"For the creators it is good because of both the motivation to improve their tools driven by competition with other test creators, and also the feedback they get from users of their tests," Pontikos told BioArray News this week.

He also said that the availability of such tools is "good for science," because a "plurality of eyes," meaning test creators and users, examine genetic data "trying to detect interesting patterns in them that might be missed by more narrowly-focused research."

As more people are involving themselves in analyzing genetic data about human ancestry, Pontikos said a "whole ecosystem of ideas" has materialized as "people try to fit their results into a broader pattern of human history." The result, he noted, is a second tier of discovery led by amateurs, one that is "less structured and more noisy in terms of ideas that don't pan out," but that is also "more dynamic, fast-paced, and democratic," and "complementary to academic research."


Doug McDonald is a professor of chemistry at the University of Illinois. For the past few years, he has offered biogeographical ancestry, or BGA, analysis to ancestry testing customers who wish to have their array data reanalyzed. Customers send McDonald a raw array data file downloaded from the provider and he responds to them with data, plots, and analyses about their possible ancestry. To date, McDonald said he has reanalyzed about 5,500 submissions from individuals requesting BGA.

McDonald told BioArray News that he became involved in BGA through his role as data curator for the Clan Donald Y-Chromosome Project. As part of that effort, he developed a variety of mathematical methods to sort the data.

Given this experience, McDonald said that DIY array analysis tools are "very much worth people's time" and that "at least some" are accurate. "DIYDodecad itself, including both the overall assessment and the one which paints the chromosomes with percentages of all the various populations, is exceedingly good," said McDonald.

At the same time, McDonald said that it sometimes takes him three attempts to explain results to those who submit their data to him, as BGA does not always meet their expectations. Some anticipate their results to match, say, the nationalities of their four grandparents, and are perplexed when results show links to other ancestral populations, that may in turn reflect an individual's deep ancestry.

"When looking at BGA [calculators], you must understand whether they use modern comparison populations or back-calculated ancient ones," said McDonald. "If the latter, you must expect a complicated mixture for most but not all people," he said. For example, a person could test as "90 percent English and 10 percent Mideastern and actually be Dutch" on one calculator, while another would peg the person as half German and half English. Depending on the reference data used by the calculator, both results could be accurate. "People really get bothered by this," he said.

Razib Khan, author of Discover magazine's Gene Expression blog, agreed that the "big downside" to do-it-yourself array data analysis is that "people don't interpret the results, they take them at face value."

Khan told BioArray News that while the services "always put the caveat that K=5 is not necessarily five populations," meaning that the population subgroupings used by a calculator are not definitive, "people often are confused" by their results.

"The relatedness estimates in these programs are not imprecise," said Khan, "but the dimensions used to compute the relatedness often get reified."

Another issue is that more people are seeking to reanalyze their own array data. While first adopters may have had some scientific background, the growing popularity of genetic ancestry testing services has led to a new wave of data interpretation by individuals with little or no scientific experience, according to Roberta Estes, founder of the consulting firm DNAExplain.

Founded in 2004, DNAExplain provides analysis and interpretation of genetic genealogy DNA test results in a "plain English, understandable fashion" to customers, according to its website. Estes told BioArray News that one of the real benefits" of the emerging trend of do-it-yourself analysis is that "each person brings with them their unique perspective, education, skills and knowledge of their ancestry."

Estes herself has 30 years of experience with information systems and computer science. "Lots of other people are the same in fundamental ways," she said. "We are very fortunate that many of the front lines of the citizen science push are doctors and scientists in other specialties."

At the same time, she noted that a "second wave has started" of individuals conducting analyses who may have "little or no scientific background." On one hand, because of this, Estes said that she has seen "a lot of misunderstanding" among this second wave of amateur analysts. On the other, she has witnessed "a lot of growth and learning."

"Do some people use these tools to simply reinforce their long-held beliefs, regardless of the results? Of course, but most of the people want and seek the truth, whatever it is, of their recent and deep ancestry, and it is held in their DNA," said Estes.

In addition to using free tools like those offered through GedMatch, Estes said that there are other resources, such as her blog, DNAExplained, for people to educate themselves about analyzing their own array data.

Encouraging 'Citizen Science'

Some providers of ancestry testing services expressed similar concern about the use of DIY array data-analysis tools.

"When comparing AncestryDNA or other results to ones from online calculators, it is important to remember that across the various services there are different methods and data sets to analyze the data, so you should expect results to be different," the spokesperson said. The spokesperson also urged AncestryDNA clients to be aware of the lack of privacy and security on these sites.

"AncestryDNA has spent a significant amount of time and money to successfully protect our customer's data whereas other entities may not have," the spokesperson said. That being said, the same spokesperson seemed to view the trend of reanalyzing ones array data in a positive light.

"The industry is in the early days of utilizing DNA technology and analyzing genetic material," the spokesperson said, adding that the situation with regards to biogeographical analysis today is "probably not any different" than the "early days" of software and IT development, where there were "citizen technologists doing interesting things and playing a part in the evolution of technology."

It’s a perspective shared by Spencer Wells, who leads NatGeo's Genographic Project.

"I think that the democratization of one's data analysis — for instance, through the tools on GedMatch — is great," Wells told BioArray News. "People should be empowered to play a role in the process of scientific discovery, and of course they are motivated to do this in order to find out more about themselves," he said.

While Wells said that he hasn't experimented with the available online calculators very much and "can't vouch for their accuracy," he believes that it's a "cool way for people to understand their data in more detail," and that "getting more people involved in the scientific process is definitely a good thing."

In fact, he said that NatGeo is "excited about the citizen science possibilities" for its array-based Geno 2.0 service, and intends to "build additional functionality" to facilitate greater interaction. "You just need to explain the limitations of the analyses so people don't draw erroneous conclusions," he said.

Catherine Afarian, a spokesperson for 23andMe, said that the Mountain View, Calif.-based direct-to-genetics firm "encourages citizen science" as "you never know where the next big discovery or breakthrough will come from."

Afarian told BioArray News that 23andMe has tried to be "transparent" about its methodologies so that individuals can understand its reports. For its Ancestry Composition feature, she said the firm believes it has created "new technical standards" for ancestry analysis.

Like Wells, she said that 23andMe "hasn't evaluated every online tool out there and it's likely the quality of these tools ranges widely."

Still, the company is hoping to support those interested in their ancestry who would like to reanalyze their own array data. Afarian noted that 23andMe recently opened its application programming interface with the hope that "more people will build fun and fascinating new tools that help people explore and better understand their own DNA."

And Bennett Greenspan, CEO of Family Tree DNA, was similarly encouraging of customers analyzing their own data.

"The genie is out of the bottle," Greenspan told BioArray News. "These citizen scientists are smart, they don't have an agenda, and I think this is going to be a broader and broader trend," he said. Greenspan said that his "only caveat" is that those who choose to analyze their own data using independent tools "need to understand that interpretation based on different algorithms will be slightly different."

Family Tree DNA's parent company Gene By Gene did begin offering exome sequencing services via another subsidiary called DNA DTC last year. While Greenspan said that individuals analyzing their own exome data will "undoubtedly" occur in the future, the company has not yet received any orders for their exome sequencing data connected to an interest in genetic genealogy. He also noted that DNA DTC does not offer biogeographical ancestry analysis using exome sequencing data.