Skip to main content
Premium Trial:

Request an Annual Quote

Pathogen Search Underscores Need for Validation Steps When Sequencing Ancient Metagenomes


Authors of a recent study in BMC Research Notes have concluded that archeological control testing and targeted verification testing are key to avoiding false-positive pathogen identification when doing metagenomic sequencing on ancient disease samples.

As part of an effort to find pathogens behind a 16th century plague that decimated members of native Mixtec populations in and around Oaxaca, Mexico, researchers from the US, Switzerland, and Mexico did metagenomic sequencing on human remains from a well-known plague graveyard in the region.

An analysis of that sequence data hinted at the presence of a few plausible pathogen suspects, they explained. But when very similar sequences turned up in remains from a pre-plague graveyard and in soil samples, the team concluded that the apparent pathogen sequences were just as likely to be false-positive identifications as authentic plague perpetrators.

By sharing their results, the investigators hope to highlight the potential perils of relying on metagenomic sequence data alone to investigate ancient infections, the study's first author Michael Campana told In Sequence. "We thought this would be useful as a case study for other people trying to do this work."

At the time the study was done, Campana was a post-doctoral researcher in corresponding author Noreen Tuross' human evolutionary biology lab at the Harvard University Peabody Museum. He is currently based at the University of Zurich.

As part of a discussion of their own findings, he and his colleagues also provided examples of previous studies that they believe may benefit from further follow-up analyses such as control samples sequencing and/or targeted sequencing of suspected pathogens.

Generally speaking, Campana argued that researchers who find a plausible pathogen or other promising results via metagenomic sequencing of ancient samples would benefit from using targeted-enrichment methods to confirm the presence of the candidate microbe and more fully characterize it with sequencing.

"My basic recommendation is if you do some sort of metagenomic approach and you get a result you think is promising, go back with a capture-based approach and get the whole genome," Campana said. "Because then you can get really high-quality data."

Campana and his co-authors mentioned a Journal of Applied Genetics study published last spring — work that identified the malaria parasite Plasmodium falciparum and toxoplasmosis culprit Toxoplasma gondii in Egyptian mummies — as an example of metagenomics-based research that may require further follow-up validation.

In an email message to IS, that study's first author Rabab Khairat, a human genetics researcher affiliated with the University of Tübingen and Egypt's National Research Centre, said that she and her co-authors confirmed the presence of P. falciparum in the mummy samples in unpublished experiments using PCR and Sanger sequencing-based tests for two membrane antigens that are highly specific to that malaria pathogen.

Those results were included in a paper published in PLOS One by members of the team at around the same time, Khairat added, noting that the P. falciparum pathogen was specifically detected in mummy specimens obtained from warm climates.

"We confirmed the result of non-ubiquitary eukaryotic pathogens by PCR to highly specific targets," she noted, "and [by] analyzing the mapping to the reference sequences."

Moreover, Khairat and colleague Markus Ball, also with the University of Tübingen, offered critiques of approaches used in the Mexican plague study, arguing that the Helicos sequencing method used is not suited for metagenomic sequencing due to the short reads it produces.

They also questioned the significance of the apparent false-positive pathogen identifications described, since BLAST searches were done after metagenomic sequence mapping and involved somewhat scant and non-specific matches to the reference sequences for potential pathogens.

Nevertheless, Khairat agreed that follow-up analyses are necessary to confirm metagenomic sequence-based findings in general. "The right concept would be to BLAST first the sequences with high significance parameters and [check] the results for possible pathogens," she said in an email. "If there are sequences matching a specific pathogen, the mapping should be used to confirm or withdraw these finding[s] by checking if the pathogenic genome is covered equally."

For their newly described analysis, Campana and his co-authors focused their efforts on trying to find the causative agent behind a deadly plague that affected Native Mexicans in the mid-1500s.

The disease, known as "huey cocoliztli," or "The Great Pestilence," was not believed to be present in the region prior to Spanish colonization, the study's authors explained. According to available historical records, the disease had far deadlier effects in native populations than in colonial settlers, suggesting it was new to the region.

As part of an ongoing collaboration with Mexico's National Institute of Anthropology and History (the Instituto Nacional de Antropologia e Historia), senior author Tuross and her team had access to samples from a site known as Teposcolula Yucundaa.

"It is interesting anthropologically to begin with, because we didn't know what the disease was," Campana said.

He noted that a few researchers had attempted to search for specific pathogens using older technologies such as PCR. But the current study marked the first instance in which high-throughput metagenomic sequencing had been applied to trying to solve this particular plague pathogen mystery.

To that end, the researchers sequenced DNA isolated from the femoral bones of a dozen deceased individuals buried at a graveyard called Grand Plaza — which contains a previously described "plague pit" for individuals who succumbed to the 1540 outbreak of huey cocoliztli — or at a burial site known as the Churchyard that was used prior to the outbreak. They also sequenced DNA from soil samples collected near the ancient remains.

"The only thing we did in terms of trying to identify false-positives was to sequence a lot of environmental controls and samples that came from people who we don't think had the disease," Campana said.

The team did metagenomic sequencing on pooled samples from each site using the Helicos HeliScope, grouping together four samples from the Churchyard control site and two sets of four samples apiece from the Grand Plaza site.

"Bulking the samples increased the likelihood of detecting the pathogen since only a fraction of the infected individuals are expected to have endogenous disease DNA preserved," the study authors noted.

Subsets of the bulk samples were subsequently sheared or treated with a phosphatase enzyme that's been used to increase HeliScope sequence yields in past studies.

From there, the researchers prepared libraries from bulk untreated, sheared, or phosphatase-treated bulk samples and sequencing the samples by single-molecule sequencing on the HeliScope.

The group was interested in generating Helicos data as a possible means of circumventing the previously described sequence biases associated with Illumina sequencing, Campana noted, adding that protocols associated with the Helicos instrument are relatively simple and straightforward.

As it turned out, the Helicos instruments produced metagenomic sequence reads that were very short — in most cases fewer than 30 base pairs apiece — making it tricky to identify organisms in the mix due to noise. Those profiles partly reflect the short-read nature of the platform itself, which generates up to 60 base reads using some settings, Campana noted.

But read lengths also appeared to be diminished due to the presence of ancient DNA. The study's authors noted that the instrument appeared to have problems reading through sequence degradations involving certain forms of uracil — a possibility that still needs to be formally explored.

To complement that data, the team used Illumina's HiSeq 2500 to sequence DNA extracts from six of the ancient bones (five from the Grand Plaza plague burial site and one from the Churchyard site). It also did Illumina metagenomic sequencing on DNA from a single soil sample.

The Illumina sequences were much longer than those generated with the Heliscope, researchers reported, though the proportion of endogenous DNA was somewhat lower in the Illumina dataset.

The group used an Illumina protocol that was a bit different from those typically used for ancient samples, Campana noted. "We used a robot because we wanted to see if it would work, basically."

When the researchers ran their sequence data through metagenome analysis software called MEGAN — comparing the metagenomic sequence reads to sequences from as many organisms as they could get their hands on — they found a few potential suspects in the Grand Plaza plague samples.

In particular, analysis of metagenomic sequence from the Grand Plaza sample unearthed reads resembling sequences found in the pneumonic plague pathogen Yersinia pestis and species in the Rickettsia genus, which can cause rickettsiosis.

Both culprits could theoretically cause conditions that broadly fit with symptoms documented by Spanish priests during the huey cocoliztli outbreak.

Even so, Campana said, it was still difficult to try to make predictions about the disease due to issues with the translation of some of those records and with the fact that the priests did not have medical training.

"The Spanish priests weren't trained medical doctors," he noted. "And even if they were, they wouldn't have recorded the symptoms [as we would today]."

Even more troubling was the fact that some of the most suspicious sequences detected in samples from the plague victims also appeared in control samples from the pre-plague cemetery and/or in soil samples, perhaps reflecting the fact that the potential pathogens in question tend to share sequence similarities with their more harmless soil relatives.

Indeed, when they looked more closely at control samples, the investigators saw the saw sequences there, arguing against the authenticity of the potential pathogen associations.

And as far as metagenomic sequence differences between the plague samples and the controls? There were "none that I would trust between the two," Campana said.

Based on their findings so far, the study's authors argued that similar metagenomic studies on ancient samples would benefit not only from sequencing multiple controls, but also from doing targeted enrichment-based sequencing to determine whether candidate pathogens are truly present in a given sample.

"Our results demonstrate that false positives are a serious problem for analyses identifying molecules via alignment against reference genomes," they wrote, "and for analyses that omit sequencing archaeological controls."

"Nevertheless," they added, "capture of complete species-specific diagnostic sequences and genomes may be a viable method for isolating and verifying ancient pathogen DNA in the absence of these controls."

So far, the researchers have not gone back to do targeted sequencing on huey cocoliztli samples to see if they can nab full-length sequences for potential pathogens, though they are considering strategies for doing such analyses.