NEW YORK (GenomeWeb) – Scientists have devised a software-driven system to identify and characterize bacterial populations in environmental and clinical shotgun metagenomic samples that they say reduces false positives.
Led by Patrick Chain, scientists at the Los Alamos National Laboratory have combined bacterial genome databases and a search algorithm to create a system that can reveal the constituents of metagenomic samples through a process dubbed Genomic Origins Through Taxonomic Challenge (GOTTCHA). The system uses unique reads obtained through next-generation sequencing to resolve the taxonomy of bacterial species in metagenomes at any level from class all the way down to bacterial strain. They recently published their method in Nucleic Acids Research.
"It was developed around the idea of trying to eliminate false positive hits or identification, which we observed was quite problematic of other tools we tried," Chain told GenomeWeb. "We hope that this is applicable to anything. We've tried it with soil, air filter, and water samples, and human-derived clinical samples. It's amenable to any type of metagenomic effort."
The algorithm works by first identifying unique sequences that can be used to identify bacteria at every taxonomic level and creating databases for each level. It winnows the level database by excluding regions of the genome that would confuse a tool searching for similarities and alignments.
"Most other tools work by targeting conserved genes from the global database and using a last common ancestor approach for classification based on hits to those specific genes," Chain explained. This type of tool will assemble a genome and compare it to a reference database in order to find a taxonomic match. "Because identical stretches are removed from the database, we have made a tremendous improvement in the reduction of false positive identifications."
There's still plenty of genome left to use for comparison, though. Even at the strain level, the most specific, about 85 to 90 percent of the average bacterial genome, can be part of a unique identifier.
For a metagenomic sample to be processed, first the shotgunned NGS metagenome reads are trimmed and split, then mapped to the reference genome databases that only have the unique sequences. "You need to put it in a bin that you already know of," Chain explained.
To find that bin, the software looks for a sequence alignment file (SAM) to parse, although the authors wrote that GOTTCHA can use other aligner output formats and data types.
After the search algorithm aligns the reads to the sequences in the taxonomic level databases, GOTTCHA reports matches for only those genomes that have sufficient coverage to make a characterization.
The tools and databases are provided through the open-source code repository Github under a public license. The GOTTCHA databases are each about 1 gigabyte large and can be accessed via FTP, with instructions from the Github site. The program can be run on a laptop and does not need a large server or supercomputing cluster.
To test GOTTCHA, the scientists examined its performance in classifying nearly 2,000 draft genomes, including 1,027 novel strains, 658 novel species, 150 novel genera, 10 novel families, and four novel classes. Indicative of GOTTCHA's success, the authors wrote that the method was able to properly place the genomes of the novel strains into the correct bacterial species 92 percent of the time.
Not only do the unique sequence databases seem to reduce false positives, they increase the flexibility of the method, an important quality in an rapidly advancing field, according to the study authors.
Metagenomics has advanced to the point to where creating the databases for metagenome analyses has become the primary challenge, rather than the methods to search them, Chain said. "The databases, once created, are almost immediately obsolete," he said. But GOTTCHA may be able to keep up with the furious pace of progress in the field.
"For all gene-specific classification tools, you normally have to redo a full search for inclusivity and exclusivity to retrain or reconstruct your database," he said. But Chain and his colleagues were working on a way to to incrementally augment GOTTCHA's databases without having to re-compute everything every time they wanted to update a database.
Unique sequences can be reassigned as identifiers based on new information. For example, if a newly discovered bacterial species is found to share a sequence with a different genus of bacteria that GOTTCHA had previously used to identify that genus, that sequence could be reassigned to be an identifier for a higher-level taxonomy that includes both the existing genus and the new species.
GOTTCHA's ability to characterize complex microbial communities like soil and water samples is yet another advantage for the read-based approach.
"For more complex samples, occasionally you don't get good assembly, because of the sheer complexity of community," Chain said. "You need to sequence just too much to get any decent amount of assembly, so a read-based approach may be more practical." He added that assembly can be complicated by the fact that it takes a lot of memory, and a small lab might not have the necessary computational resources.
And it's not just complexity, but the relative abundance of the different microbial strains that makes metagenomic characterization difficult. "Most tools will do a good job on anything that's highly abundant. Our tool will do quite well even with organisms of low abundance," Chain said.
"We perform calculations for the amount of each genome that is covered by reads as well as the average depth of coverage and use these to help determine relative abundance instead of relying on number of reads or hits," he said.
The scientists tested GOTTCHA's performance versus existing methods on several real as well as spiked samples, including soil, air filter, and human stool samples. It "consistently produced superior classification and relative abundance predictions compared to the three other classifiers considered," the authors wrote. The other classifiers were MetaPhlAn, mOTUs and Kraken, which, like GOTTCHA, are all shotgun metagenome sequencing-based methods. And even at very low concentrations, GOTTCHA did a good job in being able to identify bacterial organisms in the spiked samples, Chain said.
The paper also said that GOTTCHA can identify viruses in metagenomes, something the other methods couldn't do.
Though GOTTCHA may have outperformed its peers, it still had limitations. "The only things we can report with ultra high confidence are the contrived samples," he said, which would make the tool useful only for research, at least for the time being.
But, with improvements, that high-confidence identification could lead scientists to use GOTTCHA in many other applications. The authors suggested it could be used in environmental biosurveillance, agriculture and water quality monitoring, bioreactor yield monitoring, and even clinical diagnostics.
"We are working on a version that's a lot more hardened, that's written in more concise programming language to try and address the field of diagnostics, which is kind of the golden target right now," Chain said. "But there's still a lot more work that needs to be done in that respect."