The Genographic Project, a five-year venture led by IBM and National Geographic that kicked off in 2005,has released a massive database of standardized human mitochondrial DNA. This will serve as the foundation for the remainder of the project, which aimsto study human genetic lineages by genotyping the mtDNA of hundreds of thousands of subjects.
The database, described in the June issue of PLoS Genetics, includes genotypes for 78,590 public participants and is the largest resource of its kind ever compiled, according to the study authors.
In addition, the project is releasing a software tool that it developed for classifying haplotypes into haplogroups that demonstrates “superior performance over rule-based approaches, given a sufficiently large reference database,” according to the study authors.
The project is in the midst of sequencing and classifying thousands of samples contributed by participants who have bought a $99 cheek swab kit. Male samples are analyzed for a combination of male-specific Y chromosome, short tandem repeat loci, and SNPs, while female samples are subjected to mtDNA genotyping, which includes sequencing of the first hypervariant segment of mtDNA, HVS-I. Female samples are also typed using a panel of 22 coding-region biallelic sites.
Participants can choose whether to donate anonymous genotyping results to the Genographic research database. Of the 78,590 mtDNA samples analyzed so far, 21,141 are “consented” and available to the broader research community.
Saharon Rosset, an IBM research scientist and co-author on the PLoS Genetics paper, told BioInform this week that the data for the paper “was actually collected in a legacy database … [which has] now been integrated.”
The primary goal of the paper, he said “is to understand, from a scientific perspective, what is the best way to analyze mitochondria and classify it?”
Rosset said that the large genetic database helped give rise to the new haplogroup classification software.
“One contribution of this paper is the new classification methodology that makes use of the fact that we have a database of unprecedented size and use to improve the accuracy of the classification of new samples,” Rosset said.
According to the paper, the method that Rosset and colleagues developed, when used with the database, “has been shown to assign more mtDNA genomes to their correct [haplogroup] than prediction methods based on the classic set of HVS-I motifs.”
In addition, the authors applied this nearest-neighbor methodology “to published databases that are external to the Genographic Project and from various populations.” In this analysis, prediction scores ranged from 77.9 percent for a non-West Eurasion database to 93.8 percent for a database of a Western European population. These results are consistent with the authors’ expectations, they wrote, because the reference database developed for the Genographic Project is heavily weighted toward participants in the US and Western Europe.
“We expect that the best prediction scores will currently be obtained in samples of West Eurasian ancestry … and that the predictions will gradually improve for other populations as the Genographic Project progresses and worldwide samples are obtained and included in the reference database,” the authors wrote.
Code for the haplogroup prediction tool is available from the Genographic Project website in two forms: independent code that can be used with any reference database; and via an interface that allows users to upload samples to the Genographic reference database and analyze them with the software:
“Anyone can use our methodology against their database to compare and, hopefully improve, the genetic testing they did, say, with some commercial company,” Rosset said.
Aiming for 100,000 Samples
The Genographic Project’s goal is to obtain around 100,000 samples over the five-year period. On the public side, Rosset said they were targeting more than 100,000 cumulatively for both mitochondrial and Y-chromosome samples, but “we are now already [at] 200,000.” So they plan to accept as many samples as they can within the next few years until the project’s completion.
“One contribution of this paper is the new classification methodology that makes use of the fact that we have a database of unprecedented size and use to improve the accuracy of the classification of new samples.”
Rosset told BioInform that 200,000 kits have been sold, with an unspecified number of Y-chromosome samples having been returned. However, he said it’s in the same range of the mitochondrial sample rate-of-return.
To facilitate this process, participants have been filling out questionnaires on their ancestry, including ethnicity. The web site also contains an extensive library with information culled from the 10 different research centers around the world that solicit input from indigenous groups in their regions.
Rosset said that the project has developed a virtually foolproof system to ensure that participants can maintain anonymity.
“A ‘participation code’ is inside [the kit], and [this is on] stickers, which are the only way to be identified in the project,” he said. “We have an extremely careful wet lab process.”
Some are less confident about the project’s scientific motivation and guarantee of anonymity, however. George Patrinos, a cell biologist and geneticist at Erasmus University Medical Center in the Netherlands who is not involved in the project, told BioInform via e-mail that he’s “not convinced that this project is based on a non-profit model,” noting that $99 “is a considerably high price for genotyping.”
In addition, he said, “I am not convinced that anonymity is maintained since the participants provide data on their geneology.”
Asked if the technology underlying the Genographic Project is capable of meeting the project’s goal of genotyping 100,000 people, Patrinos was skeptical.
“The technology has to be eventually replaced to address high-throughput and low-cost genotyping,” he said, though he added that he didn’t know if even that would reduce participation costs.
— Bernadette Toner contributed to this article.