Researchers have developed a new software program to analyze large metagenomic data sets — including data obtained from next-generation sequencers — in order to identify the species they contain.
The package, one of the first metagenomics software tools to analyze shorter reads from next-generation sequencing platforms, could place the emerging discipline of environmental sequencing within reach of more researchers — and possibly drive the adoption of new sequencing technologies in the field.
“It is poised towards a single investigator … and you don’t have to be an expert bioinformatician to be able to analyze metagenomic data,” said Stephan Schuster, one of the developers of the software and an early adopter of 454 Life Sciences’ sequencing technology. The software, he pointed out, can analyze any type of DNA sequence data, ranging from 35-base reads to Sanger reads and partially assembled sequences.
An international research team, led by Schuster, a professor at Pennsylvania State University, and Daniel Huson, a researcher at the Center for Bioinformatics at Tübingen University in Germany, developed the program, called MEGAN for Metagenome Analyzer, and described it in a paper published online last week in Genome Research.
In order to use MEGAN, users must first compare their metagenomic data set against a database of known sequences using a comparison tool such as Blast. MEGAN then takes the output from this comparison and assigns taxon IDs to the species names, found in a database such as the National Center for Biotechnology Information taxonomy. It then displays the results as a phylogenetic tree.
Users only need to use the comparison tool once and can then adjust the stringency of the alignment in MEGAN. “You only need to compute once and can play with the different settings later on a notebook,” Schuster said.
The software — originally called Genome Taxonomy Browser, but later renamed to avoid confusion with another program — came out of a sequencing project of a mammoth bone, published by Schuster’s group last year, that used 454’s GS 20 sequencer. About half of the DNA sequence reads the researchers obtained in that project came from contaminating species, making it essentially a metagenomics data set. In response, the scientists developed MEGAN.
In order to test the utility of MEGAN for short DNA reads obtained from different next-generation sequencers, the researchers simulated 5,000 random reads of different lengths from two known genomes, those of Escherichia coli K12 and Bdellovibrio bacteriovorus HD100. They chose 35-base, 100-base, 200-base, and 800-base reads — roughly corresponding to the output of Solexa’s Genome Analyzer, 454’s Genome Sequencer 20, 454’s Genome Sequencer FLX, and conventional Sanger sequencing, respectively.
“The answer is, you can even identify a species from 35 base pairs, but you do so with a very low efficiency,” said Schuster. He sees 200-base reads as a good compromise, “because it gives you a higher confidence for the individual sequence tag, but at the same time, you maintain the advantage of the low cost of next-generation sequencing,” he said.
Other methods to analyze metagenomic data have mostly relied on screening specific phylogenetic markers and generating partial genome assemblies of Sanger reads, according to Schuster. But those approaches have shortcomings like cloning biases and sequencing biases, while next-generation sequencing, which generates reads at random, is more bias-free, he said. “This is why it was so important that we come up with a new way of making phylogenies for metagenomes that can use random reads,” he said.
“The big difference of [our method] is, it is using the power of statistics to get to very similar results that ultradeep sequencing has previously been used for.”
Using MEGAN to analyze a small amount of data can also generate some of the same results of in-depth metagenomic sequencing projects, but much more cheaply,the researchers believe. To prove their point, they re-analyzed 10,000 DNA reads from the Venter Institute’s Sargasso Sea project, which generated almost 2 million Sanger reads, and were able to identify the 16 taxa present in the samples from that smaller dataset. “We simply chose this example to demonstrate that a very good overlook of all present taxa does not need very expensive metagenomics projects,” Schuster said.
Users and potential users of the software agree that MEGAN should allow many researchers to perform their own metagenomic analyses. However, some questioned whether Blast is the best sequence-comparison algorithm to be used as the first step.
“It’s well known that Blast is not an accurate means of phylogenetic assignment,” said Susannah Green Tringe, a researcher at the DOE’s Joint Genome Institute, in an e-mail message. Tringe has used MEGAN to “let collaborators know what we're seeing in their data at the initial QC stage” and found it “very handy.”
Despite’s Blast’s shortcomings, Tringe said, the algorithm “is one of the only tools we have for analyzing individual sequence reads and, on the whole, the range of Blast hits is usually a useful piece of information when analyzing metagenomic data.”
Other labs have written their own software to analyze the output of Blast searches phylogenetically, she said, “but MEGAN presents the data in a far more informative and interactive format.”
Jonathan Eisen, a professor at the University of California, Davis, agreed that MEGAN should be “a useful tool for many people doing metagenomics analysis.” However, like Tringe, he criticized the use of Blast. “Blast is a way of measuring sequence similarity,” he said, but “similarity does not always equal relatedness.” Eisen said he prefers to assign sequence reads from metagenomics data sets to organisms using phylogenetic analysis instead.
Daniel Huson, MEGAN’s co-developer, responded that “by design, a read will only give rise to a specific taxon identification if the set of sequences that it matches comes from a closely related group of species. Matches that do not reflect relatedness usually do not follow this pattern and thus will produce unspecific taxon identifications, leading to false negatives rather than false positives.”
Schuster said that later versions of the software will address functional categories, in addition to taxa. He is also planning to interface MEGAN with functional databases such as COG, Pfam, GO, and others, as well as with the Ribosomal Database Project, “so that in the end, it will be able to compare metagenomic data, 16S data, and functional data and make this into a combined analysis.”
MEGAN is available free of charge to academic users, who can download the software here after registering. Corporate users need to obtain a commercial license.