NEW YORK (GenomeWeb) – Researchers led by computer scientist Yaniv Erlich have analyzed giant crowd-sourced family trees with up to 13 million members, generated from 86 million public profiles from a genealogy website. Based on the birth and death dates of millions of relative pairs, they estimated the genetic component of longevity to be lower than what other studies have reported, a finding that some experts dispute.
The data also offer fascinating insights into the migration of families over hundreds of years, and may aide other studies that combine family trees with DNA data.
"For sure, the next logical step is to start to overlay autosomal DNA data for those individuals who were consented for research," said Erlich, who is the CSO of MyHeritage, a genealogy company based in Israel that owns Geni.com and provides DNA ancestry testing. He also holds academic appointments at Columbia University and the New York Genome Center.
The study, which he started prior to joining MyHeritage, has been seven years in the making and appears in Science today, though his team published a preprint of the paper on the BioRxiv server a year ago.
Atul Butte, a professor at the University of California, San Francisco, and director of its Institute for Computational Health Sciences, said he is impressed by the work. "Being able to build a pedigree of 13 million individuals is particularly amazing, given that number is larger than 68 percent of the countries in the world," he said.
For their analysis, the researchers focused on Geni.com, a genealogy website that allows users to upload their family tree and connect it with others if there is a shared relative. According to its website, Geni has close to 120 million individuals in its database, including both people alive today and their ancestors. MyHeritage acquired Geni, which was co-founded by former PayPal COO David Sacks, in 2012 for an undisclosed amount.
After obtaining permission, Erlich's team downloaded more than 86 million publicly available profiles from the site, first in 2011 and then again in 2015. The next step was to organize and clean up the family trees, he explained, removing, for example, instances where an individual appeared to be both the father and the son of someone, or where a person was shown as having more than two parents.
This resulted in 5.3 million separate family trees, the biggest of which includes 13 million people, all connected by ancestry or marriage. On average, that tree spans 11 generations between the founders and the last descendant. "We have some profiles going back to the 1400s," Erlich said. "You would be amazed to see that in each family, there is a genealogist who is really into documenting these family trees extensively."
To verify the accuracy of their crowd-sourced family tree with genetic data, the researchers obtained mitochondrial DNA data for 211 lineages, and Y-STR haplotype data for 27 lineages from users who had shared their DNA test results publicly on websites such as Ysearch.org or Mitosearch.org. They then compared pairs of relatives in its tree, finding that the non-maternity and non-paternity rates were similar to those of previous studies. "Taken together, these results demonstrate that millions of genealogists can collaborate in order to produce high-quality population-scale family trees," they wrote.
For their analysis of the heritability of longevity, they looked at the age at death of related individuals in their database and defined longevity as the difference between the age of death from the expected lifespan.
Previous studies in various populations, including Mormons, Danes, French-Canadians, and the Amish, all conducted before 2002, had come up with various estimates for the inherited part of longevity, on the order of 15 to 30 percent. However, those studies only involved up to 80,000 individuals each.
In various study designs that involved millions of relative pairs, Erlich and his team were able to "decompose the components of variance of longevity, so we could assess the heritability of longevity much more precisely," he said. That way, they were able to tease out how much additivity, dominance, shared household environment, and epistasis contribute to inherited longevity, and found that most of the effects were additive. That additive component, about 16 percent, is considerably smaller than the 25 percent often cited in the literature, the researchers noted.
"There is a long-lasting debate in the field about whether epistasis contributes to the architecture of complex traits," Erlich said, and whether this could explain why genome-wide association studies have so far not discovered many variants connected to longevity. "At least for longevity, despite the massive dataset and the different types of analysis, we couldn't find any signs of epistasis in our data. But we did find 4 percent dominance."
Overall, the researchers concluded, their results "indicate that previous studies are likely to have over-estimated the heritability of longevity. As such, we should lower our expectations about our ability to predict longevity from genomic data and presumably to identify causal genetic variants."
Longevity, Erlich suggested, may have to do more with luck and circumstance than genetics. "There are no guarantees," he said. "Maybe your parents lived long, but it doesn't mean you are going to live long, and also the opposite, if your parents died younger, it doesn't mean you are doomed."
However, not everyone agrees with this finding. According to Paola Sebastiani, a professor of biostatistics at Boston University who has been studying human longevity, "their conclusion about the low heritability of longevity is too broad and not backed up by the data."
She said that the team's definition of longevity is not 'long life' in an absolute sense but 'variability around age of death' in a broad sense, and that defining longevity so loosely can influence the results. She likened this to measuring the variation of blood glucose levels to study the genetics of diabetes, when only extreme blood glucose levels above a certain threshold define diabetes.
In an article published two years ago in the Journals of Gerontology, she said, her team "advocated for using a sound definition of longevity based on survival probabilities of well-defined cohort tables" and showed that "the heritability of longevity depends on how extreme the survival probabilities used to define longevity are." For example, heritability was low if they looked at those living longer than 95 percent of the population, but it was much higher when they looked at the extremes, individuals living longer than 99.9 percent or 99.99 percent of the population.
"As we look at more and more extreme definitions of longevity and sufficiently large sample sizes, the yield of genetic findings increases," she said. In genome-wide association studies published last year, for example, her group uncovered new rare variants associated with extreme longevity.
While Erlich's analysis of the Geni data appears to be the first of this scale, harnessing genealogy data from millions of individuals to study the heritability of longevity, AncestryDNA has also said it is working on such a study.
In 2015, the company, a subsidiary of Ancestry.com, said it had teamed up with Calico, which focuses on longevity research and therapeutics, to study the genetics of human lifespan using data from millions of public family trees and Ancestry's database of more than a million genetic samples. Specifically, the firms said they were planning to "investigate the role of genetics and its influences in families experiencing unusual longevity using Ancestry's proprietary databases, tools, and algorithms," and that Calico would seek to develop therapeutics based on the results. A spokesperson for Ancestry declined to comment on the status of that research.
Erlich said his team, with permission from MyHeritage, is offering researchers access to its Geni pedigree and demographic data in a de-identified format via a website, FamiLinx.org. "We want to enable the community … to see where other people are taking this study," he said, adding that they also have surveys for participants available that are modeled after questionnaires used by the UK Biobank.
In addition, through an API, Geni participants can link their Geni profile openly to participate in research that combines their pedigree with other types of data, such as genomics information. Erlich said that DNA.Land, a website he founded in 2015 that lets users contribute their own genome data for research and helps them interpret it, for example, has been using this mechanism for the last three years, overlaying users' genomes with their family trees. DNA.Land is independent of MyHeritage, and lets users upload DNA data from 23andMe, Ancestry, and MyFamilyTreeDNA.
Researchers can learn a lot from marrying genomic data with family tree data, he said, for example about relatedness of distant relatives. "We have theoretical models about what fourth cousins should look like in terms of identity-by-descent," he said, and using those data, "you can actually test that." MyHeritage has being doing this already, he said, allowing it to improve its relative-matching pipeline.