It wasn’t all that long ago that the idea of taking a shovelful of dirt back to the lab and sequencing every scrap of accessible DNA in it for an overview of the soil community was considered completely novel. (And, as you may recall, somewhat ridiculous.)
Fast-forward a few years — past Eddy Rubin impressing researchers with an early community sequencing story about an acid drain mine, past Craig Venter announcing that he would sequence the Sargasso Sea and then sail around the world, collecting more samples everywhere he went — and metagenomics is practically de rigueur. From the bacteria in someone’s mouth to seawater to the viral community in feces, if it’s a group of organisms cohabitating in a particular environment, there’s a good chance someone has already taken a peek at it.
The growing popularity of the metagenomics field was only hurried along by the advent of cheaper sequencing technology. Early papers from scientists using 454’s next-gen sequencers reported the generation of reads upon reads of community sequence.
But the slew of sequences and rising interest level have a price, and that price is measured in headaches by bioinformaticists. Metagenomics brings a whole new series of data analysis challenges that have yet to be fully resolved: when you’re sequencing little bits of countless genomes, how do you make sense of the reads? When is it good science to count an unknown genome sequence as novel? And on what grounds can you call a match between the bits of DNA you’ve sequenced and similar sequences in GenBank?
Not to fear, though. Bioinformatics experts and their metagenomics brethren have the problem well in hand, and new ideas for how to manage, share, and analyze this kind of data are perking right along. In fact, things are happening so rapidly in this community that last month alone saw major advances for three of the leading analysis efforts: CAMERA at the University of California, San Diego; IMG and IMG/M from the Joint Genome Institute; and MEGAN, a new algorithm published by a collaborative team led by Stephan Schuster and Daniel Huson.
“There is so much data it really takes a lot of compute power,” says Eric Delwart, an associate professor at UC San Francisco, whose own field of interest is viral metagenomics. That’s no understatement. CAMERA, or the Community Cyberinfrastructure for Advanced Marine Microbial Research and Analysis, is based at UCSD in large part because of the impressive computing resources the university could bring to bear on the problem.
The database, hosted at UCSD’s California Institute for Telecommunications and Information Technology, or Calit2, runs on a 512-CPU cluster with about 200 terabytes of dedicated storage. Another server is tasked to do the actual analysis work.
CAMERA began a little more than a year ago, when the Gordon and Betty Moore Foundation awarded $24.5 million in a seven-year grant to UCSD and the J. Craig Venter Institute to develop and implement an informatics infrastructure that would allow public access to the growing mass of metagenomics data, as well as tools to analyze it. To help with the speed of moving around all that data, the entire system was designed to utilize the OptIPuter, an optical network funded by the National Science Foundation that allows researchers to connect at speeds up to a hundred times faster than standard Internet connections.
Last month, the scientists in charge of the program went live with the first production version of CAMERA, which today houses metagenomic sequence data, related environmental parameters and other associated data, cross analysis of samples, and more. “If a scientist queries our database for a particular set of sequence data, he or she would also get back all the metadata associated with each metagenomic sequence read,” according to Paul Gilna, CAMERA’s executive director. “This is a very useful feature because the metadata could provide clues to understanding differences between microbial specimens, especially if you are comparing microbes that live in very different ocean environments.”
The metagenomics data comes largely from the Venter Institute’s Global Ocean Sampling Expedition, and also includes data from Ed DeLong at MIT — who generated information from the Hawaii Ocean Time Series Station ALOHA — as well as from viral metagenomic studies by Forest Rohwer’s lab at San Diego State University. The CAMERA keepers have pledged to continue to add metagenomic data to the repository as the scientific community requests it.
IMG & IMG/M
If any team needed to build its own metagenomics analysis platform, it would have to be JGI. The DOE-funded institute has contributed many microbial genome sequences to public databases and has seen firsthand the need to make sense of the growing complexity within those repositories.
Two years ago, JGI launched IMG, its integrated microbial genomes system, which stores reference isolate genomes sequenced by the institute and other organizations that choose to make them publicly accessible. Last year, the JGI team followed up on the success of that system with IMG/M, a management and analysis system specifically geared toward metagenomic data. “IMG/M arose from our interest in making it easier for users to access and analyze their data,” Eddy Rubin, director of the institute, told Genome Technology when IMG/M launched. The system allows users to analyze data by navigating samples based on ecotype, disease, phenotype, and relevance. Both efforts were the result of JGI collaborations with the Lawrence Berkeley National Laboratory’s Biological Data Management and Technology Center.
A year later, JGI scientists were right on schedule with an upgrade to the system. In January, they released the IMG/M update, and last month they followed that with version 2.1 of the original IMG platform, which now contains 2,782 genomes. The latest IMG/M includes metagenomic data from studies of human distal gut, obese and lean mouse gut, a gutless marine worm, and more. Available analysis methods include help with gene prediction, assembly, and binning, according to JGI.
Both CAMERA and IMG/M are designed to run on their home supercomputers, so one of the major advantages Stephan Schuster sees with his team’s technology is that anyone can download it to a desktop and run it in their own lab. “We give the user more autonomy than the other sites do,” he says.
Schuster, a Pennsylvania State University researcher who collaborated with the University of Tübingen’s Daniel Huson, offers the community a program called MEGAN, designed specifically for metagenomic analysis. The project began about a year and a half ago, when Schuster’s lab was working on an ancient DNA sequencing effort for the woolly mammoth genome using 454 technology. The goal of that was to sequence the mammoth, but Schuster realized early on that only about half of the reads his team was getting from the DNA sample belonged to the mammoth genome. Wondering what the rest of the DNA stowaways came from, “I thought, why don’t we treat this as a metagenome project?” he recalls, thinking it could reveal information about organisms from the same time period as the mammoth.
The challenge for Schuster and his colleagues was dealing with completely random data, where they weren’t afforded such luxuries as relying on 16S markers or typical phylogenetic trees. “We came up with a new approach where we would use an existing phylogeny or species tree,” he says, adding in other taxonomic information along the way. “We do not look at phylogenetic distances,” he adds. Sequences are grouped on a purely statistical basis with the closest known organism as tracked in the NCBI database. Sequences are binned according to closest organism, and in a gross simplification, the bin with the most sequences wins. Users can actually work through the MEGAN interface to explore the analysis in depth. “It’s almost like flying with Google Earth through the different taxa,” Schuster says.
Apparently there’s demand for the tool. Published in last month’s print edition of Genome Research, MEGAN gets requests from interested scientists almost every day, Schuster says. Still, he adds, “selling this new concept of metagenome analysis wasn’t always easy. People still had this idea of phylogenetic markers and phylogenetic distances in mind.”
Schuster already has plans to improve MEGAN, starting with enabling comparisons on 16S markers. Also, he’d like to add functionality for comparing multiple data sets at a time within the program (currently, data sets have to be exported and compared elsewhere).
“This is the first time in science that we can sequence faster than we can analyze,” Schuster says. “As [metagenomics] is still a very young field, what we do here right now are just the baby steps. In the upcoming months and years we hope to develop this to become even more sophisticated.”