This week, the California Institute for Telecommunications and Information Technology (Calit2) christened the first large-scale public bioinformatics resource specifically designed to support metagenomics data.
The database, called CAMERA (Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis), was launched concurrently with the publication of a trio of papers in PLoS Biology describing the largest metagenomics dataset collected to date — 7.7 million sequencing reads covering 6.3 billion base pairs — from the J. Craig Venter Institute’s Global Ocean Sampling expedition.
Calit2 began developing the resource just over a year ago, supported with a $24.5 million grant from the Gordon and Betty Moore Foundation [BioInform01-20-06]. Development partners include the JCVI, the University of California, San Diego, and the Scripps Genome Center.
CAMERA 1.0 (available here) includes data from the JCVI GOS expedition, as well as a set of around 177,000 sequencing reads from the 454 Life Sciences platform from a survey of marine viral organisms conducted by San Diego State University. It also includes a “vertical profile” of marine microbial communities sequenced at different ocean depths by the Massachusetts Institute of Technology.
“Our long-term goal and hope is that CAMERA basically becomes a community watering hole for metagenomics research,” Paul Gilna, executive director of CAMERA, told BioInform. “So members of the scientific community wishing to do research on metagenomics data would come to CAMERA and gain access to the data, and over time more and more tools to help them analyze that data. While members of the scientific community who are generating metagenomics data, we would hope, would also come to the same watering hole to share their data with the rest of the community, and as part of that we plan to have data upload capabilities,” he added.
Short-term, Gilna said CAMERA’s priorities include growing its set of metadata “so that it’s as complete as possible,” and adding data from other public metagenomic repositories to enable “one stop shopping” for metagenomics researchers.
Other goals for the near term include improved visualization, alignment, and phylogenetic analysis tools, said Gilna.
Though the sequence data from metagenomics studies will also be deposited in GenBank, Gilna said that CAMERA was designed from the ground up to meet the unique requirements of metagenomics data analysis.
“The real focus is on becoming the best resource we can be for the field of metagenomics,” he said. “We’re very dedicated to that particular field and we’re looking to push that field forward, whereas GenBank has to serve many different fields.”
Gilna said that in addition to the basic sequence data, CAMERA offers researchers access to the metadata that describes the geospatial or environmental conditions associated with particular samples — an important aspect of metagenomics analysis that sets it apart from traditional sequence analysis, in which this contextual information is of little value.
In addition, CAMERA offers “a considerable amount of pre-computed data analysis” that is expected to save researchers a great deal of time and CPU-hours. Analyzing metagenomics data, Gilna said, “takes a significant amount of compute power.”
CAMERA has a dedicated 512-CPU, 5-teraflop cluster with 200 terabytes of storage, and also has access to the resources at the San Diego Supercomputer Center, as well as the nation-wide National Science Foundation-supported Teragrid, Gilna said.
The CAMERA project is also collaborating with the OptIPuter (Optical networking, Internet Protocol) project funded by the NSF, a distributed computational backbone based on optical networking that is expected to enable scientists to transmit terabyte- to petabyte-scale data sets in real time.
One component of the OptIPuter project should address the challenge of visualizing huge amounts of metagenomics data. So-called OptIPortals, comprised of multiple LCD visualization displays, can scale up to hundreds of millions of pixels in order to display huge amounts of data in a wall-sized display.
According to Calit2, metagenomics OptIPortals are already deployed at JCVI and UCSD and are currently being installed at metagenomics labs at the University of Washington, San Diego State University, MIT, and elsewhere.
Gilna said that the CAMERA development team hopes to eventually provide a suite of bioinformatics tools specifically designed for metagenomics data analysis, but noted that the current version of the resources is “limited to standard genomics-analysis tools” such as Blast searches against metagenomics data sets.
“Our long-term goal and hope is that CAMERA basically becomes a community watering hole for metagenomics research.” |
One tool that CAMERA does include is a new comparative genomics tool developed by the JCVI research team to analyze the GOS data. The method, called “fragment recruitment plots,” graphically depicts the similarity between the sampled microbial populations and reference microbial genomes.
Only around 30 percent of the GOS reads were fully aligned — or “recruited” — to the 584 available reference genomes, according to the PLoS Biology paper describing the method, while the remainder were of low identity and used only a portion of the entire read.
In order to represent the full range of biochemical diversity of the data, the JCVI team developed a graphical tool that shows where each read, or fragment, aligns with a reference genome along the horizontal axis and then plots its degree of similarity to the reference genome along the vertical axis.
The plotted reads are color-coded according to the samples to which they belong, “thus indirectly representing various forms of metadata (geographic, environmental, and laboratory variables),” the authors wrote.
“While simple in nature, the resulting plots can be extremely informative due to the volume of data being presented,” they wrote, adding that the approach “is one of the first tools to make extensive use of the metadata collected during a metagenomic sequencing project.”
According to the JCVI researchers, “the usefulness of [fragment recruitment plots] and related approaches will only grow as the robust collection of metadata becomes routine and the variables that are most relevant to microbial communities are further elucidated.”