Skip to main content
Premium Trial:

Request an Annual Quote

Bacterial Classification Could Be Compromised as RefSeq Database Grows


NEW YORK (GenomeWeb) – A new analysis suggests some approaches for classifying bacteria from metagenomic sequences may become less precise as the collection of publicly available sequences in the RefSeq database grows.

Recognizing the ways in which the database can impact bacterial identification, in turn, is expected to guide future metagenomic interpretation methods, explained Todd Treangen, a computer science researcher at Rice University.

As they reported online last week in Genome Biology, Treangen and colleagues from the University of Maryland, the National Human Genome Research Institute, and Rice University compared RefSeq size over time, using simulated and real data to explore the consequences of this database size on lowest common ancestor (LCA) bacterial classification done with k-mer-based computational methods such as Kraken.

When the researchers looked at simulated data for 10 bacterial genomes originally used to validate Kraken, along with authentic metagenomic sequence data from fecal microbiomes and oral microbiomes assessed for efforts such as the Human Microbiome Project, they saw fewer species-level classifications with k-mer and LCA-based identification as the RefSeq database ballooned in size.

Instead, the team demonstrated, the microbes predicted from metagenomic sequences tended to edge up to broader taxonomic levels — the genus or family rather than species level, for example — as the RefSeq collection grew, leading to species-level ambiguity.

That pattern is not altogether surprising in some respects, given the massive influx of bacterial sequences submitted to RefSeq, Treangen said. A given string of 20 bases (a.k.a., "20-mer") was likely to be quite uncommon across the bacterial collection back in the early days of the database, he explained, making it relatively straightforward to pin that sequence to a specific species.

Now, though, the same k-mer may turn up in multiple microbes from relatively far-flung branches of the bacterial tree, since RefSeq contains sequences from many bacterial species and from individual species that have been sequenced over and over — for example, in the food safety setting.

When these k-mers appear in distantly related microbes, bacterial classification gets pushed up the taxonomic tree to the most recent common ancestor with the k-mer sequence.

And microbial placement problems appeared to persist even when researchers turned to a Bayesian-based computational method called Bracken, which is designed to correct for such over-generalizations. They noted that Bracken could bring classifications closer to the species level, though the criteria used to help refine the classification also appeared prone to false-positives or biased species identifications when dealing with new microbes.

Together, the complementary classification method analyses highlight areas that require further consideration, new approaches, and additional algorithms to navigate this trade-off between bacterial classification specificity and species prediction accuracy.

"The two extremes are there," Treangen said. "If there is ambiguity, you may need a new approach or additional data to confirm the presence of one of these species."

While these tradeoffs may be recognized by some researchers who routinely do metagenomic sequencing and analyses, the team hopes to highlight the potential pitfalls and limitations of specific bacterial classification tools for the broader research community, particularly as the popularity of metagenomic sequencing continues to grow.

"I think for most people the results will probably be surprising," said Mihai Pop, a computer science researcher and interim director of the University of Maryland Institute for Advanced Computer Studies. Pop was not an author on the Genome Biology paper, but is credited with providing its authors with early feedback and discussion on the project.

"If you are deep in the development of methods for the field, you would probably at least have the intuition that this might happen," he noted. "But for the typical user, this would be a very surprising result."

Based on their findings, Treangen and his co-authors concluded that "alternative approaches to traditional k-mer-based LCA identification methods … will be required to maximize the benefit of longer reads coupled with ever-increasing reference sequence databases and improve sequence classification accuracy."

The work falls within a broader effort to benchmark metagenomics methods in the context of various inputs or parameters, he said, noting that the possible classification impacts of the database itself had not been investigated fully in the past.

The authors cautioned that "while we only evaluated Kraken and Bracken in this study, the challenges of RefSeq database growth stretch beyond k-mer-based classification methods and are likely to affect other LCA-based approaches."

Though it is comparatively simple to evaluate such effects in RefSeq, since older versions of the database can still be accessed, there are concerns that such issues may exist in other reference sequence databases as well.

"We need to address this as a community, on some level," Pop said, adding that the results "point to how important the databases are and also the question of how we can get databases that will give us the information we need and not confuse us."

He predicted that investigators will need to address the problem on both the computational and database sides of the equation. For example, the current results hint that at least some of the sequence data being deposited to RefSeq are misannotated, contributing to poor specificity in Kraken-based bacterial identifications.

"The traditional mode in the field had been that you had human annotators who would carefully curate every sequence and this is clearly no longer possible," Pop said. "We have to figure out ways of doing this in an automated fashion or, at the very least, identify where mistakes could be in an automated fashion."

In the meantime, Pop noted, he typically addresses the issue by using such automated classification methods as a first pass rather than a final arbiter of bacterial identity, digging into the evidence behind an apparent annotations.

At Rice University, Treangen's team is working on a method to estimate, after the fact, whether classification calls for a given metagenomic sequence dataset are particularly aggressive or conservative, making it possible to be on the lookout for false positives or taxonomically ambiguous identifications, respectively.

The researchers are also exploring strategies for maximizing the information that can be gleaned from databases such as RefSeq while minimizing the type of redundancy that contributes to overly general identifications.

"Imagine that you can design a database that is stable over time," Treangen said, "where you could modify the database in a certain way — or run something on an earlier or later version — that would produce a much more stable result."