A team at the Department of Energy’s Joint Genome Institute has launched a web server that allows researchers to gauge how well sequence-analysis software tools handle metagenomic data sets.
The team is also in the process of creating a similar benchmarking resource for metagenomic data generated with 454 Life Sciences’ sequencing platform.
The server, called Fidelity of Analysis of Metagenomic Samples, or FAMeS, stems from a project that JGI initiated to guide its own metagenomic data analysis efforts.
Nikos Kyrpides, head of JGI’s Genome Biology Program, told BioInform that his group has been involved in the still-nascent field of metagenomics for around two years, and has analyzed datasets culled from environments as diverse as the gutless worm, enhanced biological phosphorus removal sludge, and the termite gut.
“By doing that over the last two years, we realized that … we could not really identify or quantify the accuracy level of the methods we had for processing the data,” he said. “The reason for that is that all of the methods that have been used until recently were originally developed for processing isolate genomes.”
With that in mind, and with the knowledge than many more metagenomic data sets are coming online, Kyrpides and colleagues turned to the JGI’s vast storehouse of sequenced microbial genomes to create three simulated data sets of varying complexity in order to evaluate the performance of several commonly used sequence assembly, gene prediction, and phylogenetic binning methods.
The JGI team randomly selected 113 isolate genomes available through its Integrated Microbial Genomes database, and “mixed” the original sequencing reads to recreate a metagenomic data set.
Kyrpides said that a key element of the project was the creation of three separate simulated data sets, which varied in their makeup in order to mirror the spectrum of potential variation in microbial communities.
The low-complexity data set, called simLC, contains one primary population, as in sludge, and is therefore relatively easy to assemble, Kyrpides said. The high-complexity set, simHC, has no dominant population, as in soil, in which all species are generally distributed equally. The moderate-complexity set, simMC, includes more than one dominant population and numerous low-abundance species.
Kyrpides and colleagues described the data sets and the results of their benchmarking study in the May issue of Nature Methods. In the study, they assessed a series of sequence-analysis methods in use at the JGI: the assembly algorithms Jazz, Arachne, and Phrap; and gene-prediction tools Fgenesb and a combination of Critica and Glimmer.
In addition, they evaluated three phylogenetic binning methods: a sequence similarity-based approach called Blast hit distribution; an oligonucleotide frequency method called kmers, and PhyloPythia, a new algorithm developed by IBM Research’s bioinformatics and pattern discovery group.
While noting that “the worst and best is relative, and everything depends on what you’re looking for,” Kyrpides said that the evaluation pointed to some clear weaknesses in some of the methods.
In the case of sequence assembly, for example, “Phrap is the most greedy assembler. It will try to assemble everything, so the error rate is very high,” Kyrpides said. “On the other hand, Arachne and Jazz are much more conservative, so they have much lower coverage on the assembly of the data, but what they do is very accurate.”
For gene prediction, Kyrpides said that the team was actually surprised at the results, in which Critica/Glimmer “was really bad, even though it was the one that did the gene prediction on the original data sets.” Fgenesb, developed by bioinformatics software firm Softberry, “was really much, much better at all levels,” he said.
“We hope this will provide a tool for the community for testing new methods, and trying to figure out if a new method really performs better compared to the ones we have on our website.”
The phylogentic binning step brought another surprise, Kyrpides said. Both methods in common use at JGI – Blast hit distribution and kmers – performed very poorly.
“We realized that the number of errors we get in binning with oligonucleotide frequency is really huge … and we saw that PhyloPythia had a huge advantage,” he said.
As a result of these findings, Kyrpides said the JGI team has standardized its metagenomic analysis pipeline around Arachne, Fgenesb, and PhyloPythia, though he noted that this arrangement is subject to change as new and better methods come online.
Indeed, the JGI team is currently using the FAMeS data sets to evaluate several additional methods, such as the AMOS modular assembler developed by Steve Salzberg’s group at the University of Maryland, and the GeneMark gene-prediction software developed by Mark Borodovsky at Georgia Tech.
Kyrpides noted that the JGI really has no interest in developing new methods to grapple with metagenomic data, but expects that FAMeS will be a useful resource for “groups out there with a history of developing assemblers and gene callers.”
“We hope this will provide a tool for the community for testing new methods, and trying to figure out if a new method really performs better compared to the ones we have on our website,” he said.
In addition, FAMeS should be of interest to genomic analysis groups like the JGI, which rely on public-domain methods to handle metagenomic data. Being able to benchmark a given method on a “gold standard” that is similar in complexity to a given experimental data set gives researchers valuable information about the methods at their disposal, he said. “By doing all that, not only can we find out exactly what is the accuracy of each of the methods, but can also pinpoint the [weaknesses] of the method,” which guides researchers in modifying the parameters of the method or the algorithm itself in order to increase the accuracy, he said.
Kyrpides said that his group plans to create a similar server based on a simulated metagenomic data set built with a combination of Sanger and 454 reads sequenced at the JGI.
“It’s well known that the bulk of new metagenomic data will be coming from 454,” he said. “It is very cheap, a lot of universities are already using it, and there is a great deal of 454 data coming from different groups for environmental sequencing.”
Kyrpides said that this project is just getting underway, however, so there are no results to discuss yet.