The emerging field of metagenomics got a boost from the Gordon and Betty Moore Foundation last week when it awarded $24.5 million to the University of California, San Diego, and the J. Craig Venter Institute to build a publicly available informatics infrastructure to help store, analyze, visualize, and disseminate the massive amounts of data gleaned from environmental sequencing.
The award is fairly significant, even by the standards of the Moore Foundation, which has bestowed a total of $895 million since it was established in late 2000 by Intel co-founder Gordon Moore and his wife. In 2004, the most recent year for which data are available, the foundation awarded $25.8 million across its entire Marine Microbiology initiative, which included 14 projects. Nearly half of the Marine Microbiology funding that year $13.2 million went to the Venter Institute: a $4.2-million grant to fund the institute's Sorcerer II marine microbial sampling expedition, and a $9-million grant to sequence the genomes of 130 marine microbes (https://research.venterinstitute.org/moore/).
Now, the foundation will help bring all that data together and put it in a form that will be accessible to the broader research community. The seven-year grant will support a project called the Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA), which will include hardware, software, and data resources related to marine metagenomics.
"Because of the complexity and the density of the data, you wouldn't want to look at it in your PC screen, because you just couldn't tease it apart."
The UCSD division of the California Institute for Telecommunications and Information Technology (Calit2) will lead the project, along with the Venter Institute and UCSD's Center for Earth Observations and Applications at the Scripps Institution of Oceanography. Other partners include UCSD's San Diego Supercomputer Center, the Scripps Genome Center, and the National Biomedical Computation Resource at UCSD.
The CAMERA project will present challenges in data integration, visualization, and communications infrastructure, Peter Arzberger, director of the NBCR, told BioInform. Arzberger said that in addition to advancing the understanding of marine ecosystems the primary goal of the effort the collaborative aspects of the underlying IT infrastructure will "revolutionize how we work with data."
The backbone of the system will be the so-called OptIPuter optical network, a project funded by the National Science Foundation and led by Calit2 director Larry Smarr, who also serves as principal investigator on the CAMERA project. NSF kicked off the OptIPuter project in 2002 with a five-year, $13.5 million grant. "Linking Venter Institute to Calit2 will be the first persistent application" of the network, Smarr said in a statement. The system will eventually enable other scientists to plug their compute clusters into the CAMERA infrastructure.
OptIPuter is expected to offer a hundred-fold increase over current connectivity standards, meaning that "distance is no longer a bottleneck" for collaborative projects involving large amounts of data, Arzberger said.
On the hardware side, CAMERA will have a dedicated cluster of approximately 1,000 processors and several hundred terabytes of storage, and will also be plugged into the NSF's TeraGrid distributed computing infrastructure.
Arzberger was unable to provide further details on the plans for the cluster, other than to say it will be on the scale of "easily a teraflop or beyond."
Putting the Metadata in Metagenomics
But hardware and connectivity are only part of the informatics underpinnings for the CAMERA project. The effort also poses some formidable challenges for data analysis, visualization, and integration.
"I think some of the interesting things are going to come when we can ask questions about all the analysis that goes on at all the spatial levels that we'll be able to bring into this," Arzberger said. In addition to microbial sequence data, the CAMERA database will include metadata associated with entire microbial communities as well as metadata related to where those samples were collected. In addition, the project will also incorporate satellite images corresponding to specific locales in order to provide a broader environmental context for the microbial samples.
While all of the sequence data from the project will be deposited in Genbank, "it's the environmental data that are coupled with the actual [microbial] communities that I think is going to be the new component of what we have."
While all of the sequence data from the project will be deposited in Genbank, "it's the environmental data that are coupled with the actual [microbial] communities that I think is going to be the new component of what we have," Arzberger said. "So you can begin to ask questions not only about which microbes hang out with others, but where they are located. And if you're looking at the protein-coding part, what is the distribution of this protein-coding part throughout the world? So then you can begin to ask, 'Well, what are the environmental factors will these things thrive or die in the presence or absence of this factor?'"
Some analysis and annotation tools that will become part of the system are already in use at the Venter Institute. Terry Gaasterland's lab at the Scripps Genome Center is also providing some annotation tools, Arzberger said.
Once some initial data becomes available to the broader community, Arzberger said he expects to see more external tool development. For example, he said, "My guess is that this resource is going to be of use to evolutionary biologists as well, and we'll be able to look at communities of things evolving. What are the right algorithms for looking at community evolution? I think there will be a number of analysis [developed] tools that way."
In addition, he said, "because of the complexity and the density of the data, you wouldn't want to look at it in your PC screen, because you just couldn't tease it apart." One way that the CAMERA grantees expect to tackle this problem is through large-scale tiled display walls. Scripps and Calit2 currently have these in place, and "the Venter Institute is going to get one as part of the award," Arzberger said. Researchers will be able to use these movie screen-sized displays to interact with the data in a way that is just not possible on a PC, he said, "so you can really focus in on one particular area, but then you can see where it fits in the entire context."
Arzberger was unable to provide specific development milestones for the seven-year project, but he said that "something very concrete" should be available to the broader research community within six months.
Bernadette Toner ([email protected])