Skip to main content
Premium Trial:

Request an Annual Quote

Reference Bias in Ancient DNA Data Pervasive, But Effects Unclear


NEW YORK (GenomeWeb) — A number of studies of ancient human DNA could be influenced by reference bias, though the effects of that bias aren't clear, according to a new analysis.

When ancient DNA sequencing reads — which are typically fragmented — are aligned against the human reference genome, there's a tendency for alleles that are present in that reference to become overrepresented and alternative alleles to be overlooked in the ancient alignment.

Two researchers from Uppsala University analyzed reference bias in ancient human DNA studies and reported in a preprint posted to BioRxiv earlier this month that reference bias was pervasive. However, while this bias could affect downstream analyses, the researchers were unable to tease out any patterns in its effects.

Still, they and others noted it emphasizes the need to be aware of reference bias.

"It is something that could actually be having quite a broad effect throughout all of these ancient DNA studies," Anna Gosling, a postdoctoral researcher at the University of Otago who studies genetic variation among modern and ancient Pacific populations, told GenomeWeb.

As Gosling noted, the human reference genome only captures a portion of the genetic variation present among modern humans, and ancient populations are expected to have even higher levels of diversity. These diverse ancient alleles are then less likely to be found in a modern reference genome, which then contributes to reference bias.

Additionally, the fragmented samples and low coverage common among ancient DNA studies as well as the use of randomly sampled alleles and pseudo-haploid data may further amplify its effect. It has a particular influence on reads that fall in the range of  30 basepairs to 50 basepairs — the size range that researchers often assume are authentic ancient DNA, Gosling noted.

In their study, the Uppsala researchers examined the prevalence of reference bias in published ancient DNA datasets.

They focused their analysis on SNPs at sites known to be polymorphic among modern human populations and then investigated ones thought to be heterozygous within published medium-to-high-coverage ancient human and hominin genomes, including Neanderthal and Denisovan datasets.

If a site is heterozygous, the researchers reasoned, an individual's DNA should contain the same number of reference and alternative alleles. However, they found that when they mapped their ancient genomes to the human reference, the average portion of alternative alleles was less than the expected 50 percent for all the anatomically modern humans they investigated, suggesting reference bias.

They noted that when stricter filters for mapping quality were applied to the data, there was a slightly stronger reference bias, though they noted that not using a filter could introduce other errors, such as from microbial contaminants.

Still, they noted that when no mapping filter is used, the Neanderthal and Denisovan genomes in their analyses exhibited a bias toward the alternative allele. This, they said, hints that these hominins may carry variation within their genomes that is not captured by the modern human reference genome.

The researchers noted that their analysis also indicated that that the strength of the reference bias might differ across various regions of the genome.

The human reference genome is a mosaic of individuals from different ancestral backgrounds, noted Krishna Veeramah, an assistant professor at Stony Brook University. "We probably need a better idea of … if the effect is going to be different based on if the position you are looking at tends to be in the areas that are [from an African-American or European individual]," he said.

The researchers also found that, as they expected, shorter fragments experienced a stronger reference bias than longer ones.

This means, Gosling said, that in estimates of population diversity based on heterozygosity found in samples with shorter reads, "you might be getting a lot less diversity in your ancient samples than there actually might have been."

While the Uppsala researchers found that reference bias could influence downstream analyses, the effects they found were not consistent. "They were both having an effect on what they were able to show, but the effects weren't consistent," Gosling said. "So that's interesting."

To gauge how reference bias influences estimates of population affinity, the Uppsala researchers generated four different versions of genotypes for the Scandinavian Mesolithic hunter-gatherer sf12 — one with short reads, one with long reads, one with pseudo-haploid calls, and one with diploid calls — and used D statistics to test for affinities between those and modern populations, represented by Simons Genome Diversity Project data and genotyped Human Origins population data. In general, they noted that a deviation from what they expected, which suggested an effect of reference bias on these estimates.

But, the researchers reported that the direction of the bias differed between the Simons Genome Diversity Project and genotyped Human Origins population data. In general, this indicated to them that different reference data have varying influences on bias. Based on this, they said they could not conclude that ancient DNA papers have been systematically biased in some direction. Instead, they said the bias appears to be dataset- and test-specific.

"It's really hard to conclude — and they don't really conclude — what the effect is," Veeramah said.

Still, he added that any differences due to reference bias in published analyses are likely small. It's not, he said, as if researchers would suddenly learn that Neanderthal introgression into modern humans did not occur. "It's, I think, pretty clear that it happened," he said.

Gosling concurred. She predicted that mitigating reference bias would enable researchers to piece together a more nuanced picture of variation.

There are a few different strategies might be able to account for the effect of reference bias. The Uppsala team showed that two post-mapping filtering approaches involving modifying reads or introducing a third allele type could reduce, though not eliminate, reference bias.

Gosling noted that these approaches might not be best for everyday use, as researchers in various labs would then be relying on different references, making comparisons across labs difficult. Instead, she said new methods are needed.

"Until we get more computational people onto this to figure out some new mapping methodologies, I think it's going to be very difficult to actually get around," she said. "It's very difficult, obviously, to quantify how much of an effect this might be having on the analysis we're doing when we don't know what variation there was there to start with."

Veeramah suggested simpler approaches: mapping to the chimpanzee genome, as the first Neanderthal paper did, or having a reference with both alleles. He also noted that other groups have been accounting for reference bias in a number of ways like making ancestor genomes if they are working with humans and Neanderthals. He also noted that researchers studying non-human organisms like Drosophila have also been grappling with this issue and researchers there have been relying on iterative mapping approaches.

But all of these, he said, add time to the process and it's unclear whether the reference bias effect is more important than other issues like SNP ascertainment bias.

Whatever approaches are used to account for reference bias, Veeramah and Gosling both said it is important for them to be adopted community-wide, so it can be replicated.

"It's a pretty good paper for making ancient DNA specialists start thinking about what some of our assumptions have been," Gosling added.