The National Institutes of Health this week awarded more than $14 million to informatics projects intended to support the Human Microbiome Project, which aims to take a genomic tally of microbes that live in the body and on the skin.
The size of the grants, which comprise three-quarters of the total funding that NIH provided in its first round of HMP awards, underscores the agency’s view that computational tools are sorely needed to help the project make headway in its metagenomic survey of the body’s microbial flora in health and disease.
NIH awarded $9.9 million over five years to establish a data-analysis and -coordination center, or DACC, for the project, which will be directed by Owen White of the University of Maryland School of Medicine.
In addition, Mihai Pop of the Center for Bioinformatics and Computational Biology at the University of Maryland was awarded $780,000 over three years to develop assembly and analysis software for the large-scale metagenomics initiative; Daniel Haft of the J. Craig Venter Institute was awarded $1.6 million over three years to develop methods for characterizing proteins in metagenomic data sets; Robin Knight of the University of Colorado at Boulder was awarded $1.1 million over three years to develop tools for studying the dynamics of microbial communities; and Yuzhen Ye of Indiana University, Bloomington, was awarded $770,000 over three years to develop methods for fragment assembly diversity analysis for the human microbiome.
As the HMP goes forward, many of the sequences generated will belong to microbes that have not yet been identified because they were not cultured in laboratories before. Therefore, the project will start with a pilot phase to sequence a set of reference microbial genomes in five areas: the digestive tract, mouth, skin, nose, and vagina. In this phase, approximately 600 microbial genomes will be collected and sequenced, which, when added to currently available information on microbial genomes, will comprise a total of 1,000 reference microbial genomes.
Alan Krensky, director of the Office of Portfolio Analysis and Strategic Initiatives, which oversees the NIH Roadmap for Medical Research, said in a statement that new tools and technologies are “central” to the goals of the project. “An exceptional amount of information will be generated by this project and we need robust technologies and analytical tools that are equal to the task,” he said.
Since the microbiome project is slated to be a community resource, all data it generates will be deposited in DACC as well as other public databases, including those supported by the National Center for Biotechnology Information, Vivien Bonazzi, program director in genome informatics at the National Human Genome Research Institute’s Division of Extramural Research, told BioInform.
In her view, tool development must accompany this large project in order to create a reference set of genome sequences from microbial communities, and enable analysis of the variation and the interactions of the microbiome on and in the body.
Bonazzi currently oversees an “interim” DACC housed at NIH, which includes data from sequencing centers at JCVI, Baylor College of Medicine, Washington University, and the Broad Institute. The interim center will migrate to White’s lab at the University of Maryland in the next few weeks, she said.
The DACC will assure that data goes to the public repositories and also that the annotation pipeline in the sequencing centers moves forward according to agreed-upon methods. “They also need to be able to display that [data] to the not so bioinformatically informed person, someone who doesn’t use those tools all the time,” she said. The DACC may also offer the ability to run Blast searches on many sequences in the database, she said.
White said she is assembling a DACC team with collaborative expertise on genome annotation, 16S analysis, and metagenomic sample analysis. “Like everything in informatics, if you have got someone who can code very well, but they don’t understand the biology, [so] what they can code? And a biologist who can’t code won’t be able to get stuff up on a website,” said Bonazzi.
The open source tools under development are not just intended for the members of the HMP but for the entire scientific community, she said.
Not only will the HMP generate large volumes of data, but much of this information will be new. While a “reasonably large” number of bacterial genomes have been sequenced, “there are a lot more of them we don’t know much about at all,” she said.
The reference genomes are set to be fully sequenced and assembled so that later samples — many of which will not be completely sequenced — can be compared to them. “You want to be able match it up with a picture you already have,” she said.
Some of the methods used to compare common genes within microbial families, such as pattern-matching of 16S ribosomal RNA for phylogenetic analysis, are “still in infancy,” said Bonazzi, and effective tools to analyze these sequences are lacking.
Assembly tools have been based on clean samples of identified species that were sequenced with Sanger methods. “The informatics tools were designed to handle that kind of data,” she said. “You are going to have much shorter fragments from the new sequencing technologies, you must assemble [those] and you have mixed populations of data, pieces of DNA from various microbes.”
Bonazzi explained that assembling short reads across a complex sample that contains many organisms and different strains can lead to “synthetic chimeras” and “spurious data” which do not reflect the biological sample. The awardees are now looking at “how do you think about this problem from a perspective of new sequencing technologies?” she said.
For example, new methods of binning prior to assembly can help researchers decide how to assemble the genomes. “It can reduce a level of complexity,” she said.
“This is an area where there is very little in terms of software. As far as I know there is zero software for doing the assembly.”
New assemblers and gene prediction tools are needed that can work with complex samples and low-coverage sequence rather than a constructed, assembled genome, she said. “A lot of the time we are not going to have full genomes … so you want to have gene finding tools that can work with genes with holes in them.”
Bonazzi said it will also be necessary to integrate phylogenetic profiling of bacterial families analysis and group-based gene fragment analysis, so that scientists will be able to classify sequences from a common species and find patterns of genomic signatures in samples.
“How can we take the old methods and move them into the 21st century?” she asked. Extending old tools and scaling them may not always work. “A lot of these tools will break, that is common, since they are only designed for smaller [data] volume,” she said.
Visualization and data analysis are also going to play a role in metagenomics. “It’s pretty abstract if you have these gnarly statistical hairballs that only a few people know how to use,” she said.
Overall, she and her colleagues hope to ensure that funding for computational tools development — including assembly, analysis, data mining, and visualization — are not an afterthought in metagenomics, she said.
Across NIH, particularly at NHGRI, “they have realized we need a very strong marriage between biology and computing,” she said. “As we generate a tremendous amount of data, you have to think about the tools you need concomitantly with the data you generate,” she said. “Putting it into an Excel spreadsheet doesn’t work.”
For the HMP, the tools must be adapted for mixed microbial populations to allow scientists to query the data in the context of the microbial diversity in their samples and across samples. This integration will allow researchers to correlate their findings to possible indicators of wellness and various types of disease states.
“Mihai Pop and some of the other researchers are examples of people who have been working in this area who now have the opportunity, through funding from NIH to do some really good work, rather than end up having to do that on the side,” she said.
‘Zero’ Software for Assembly
Pop told BioInform that some of the current tools available to genomics researchers are partially applicable to metagenomics. For example, he said that several web-based applications and databases exist to compare 16S ribosomal RNA sequences extracted from environmental samples, such as greengenes and the Ribosomal Database Project.
However, metagenomics presents its own set of challenges. “It has been easier to sequence one gene relatively deeply from an environment rather than to sequence everything that is present in an environment,” he said.
Assembly of metagenomics samples is a particular challenge he plans to take on. “This is an area where there is very little in terms of software. As far as I know there is zero software for doing the assembly,” he said. “Everybody uses tools that were developed for single organisms.”
Using software developed for single-organism sequencing and assembly for a population hides some important data. “It essentially hides the fact that there is population structure, that there are two different strains of the same organism that initially looked the same,” he said.
Currently available assemblers collapse the data into a mosaic structure, a consensus chromosome, he said. “A lot of the interesting biological phenomena get hidden and it is actually pretty hard to go back into the data and extract what is interesting.”
In field studies, Pop said, scientists have encountered cases in which organisms had identical 16S sequences but fulfilled utterly different functions and there was indication these may have even been separate species. “So the 16S is an approximation,” he said.
When it comes to microbes, the genome structures in a microbial community differ greatly in subtle ways, with phage inserts or mutations that affect population dynamics, he said. To study antibiotic resistance in bacteria, new tools that reveal genome structures may reveal to what degree bacteria harbor genes that confer antibiotic resistance. “The new assembly tools will give you the information to ask these questions,” he said.
Pop and his colleagues previously developed AMOS, an open source framework for gene assembly and gene finding algorithms. Now, with this new grant, he plans to extend the algorithm for metagenomics.
For example, AMOScmp might come into play, which is a comparative sequence assembler that maps shotgun reads to a reference sequence. Given the references to be generated in the Human Microbiome Project, the shotgun data from an environmental sample can be mapped to a reference genome. What is missing is the evaluation of the comparison, he said.
“The key here is that we want to find out what are the differences between the organisms in the environment and the organisms in these reference sequences,” he said. “That is something that is missing in general.”
Scientists may only want to hone in on a low level of genetic variation. “It’s a dirty little secret in microbial genomics,” he said, that while there are many sequenced microbial sequences currently held in databases, “the reality is that there are thousands of variants of each strain and the one in the database is just one that managed to grow under certain conditions.”
Currently, AMOScmp displays sequence differences but stops short of analyzing them. “We don’t have a good way of explaining what the differences are, so that is something we are adding,” he said.
For example, one strain of E. coli may have small insertions. AMOScmp might miss that in its current version, depending on which reference genome was used. In its next iteration, the software will be able to give users a more detailed analysis. “It will be to detect inserted sections and [tell users] that based on all these other reads, we think these other reads also match in this region: it’s a combination of a comparative approach and a de novo approach for assembly,” Pop said.
That feature also leads to another facet of his planned work, which is gene prediction. This project will yield plenty of sequences that do not map to any reference. “The question is, can you group them together somehow?” he said, adding that assembly might help discover how to group them.
“We trying a variety of extensions to assembly algorithms to be able to look at population structures,” he said.
Other tools that he wants to extend and shape for metagenomics include Minimus, an assembly pipeline for small datasets that his team used for the flu genome; and Bambus, a scaffolding program for orienting contigs.
Overall, Pop explained, he would like to be able to offer scientists a comprehensive set of metagenomics software tools. If a researcher gives his team a set of reads from a given environment, the team would be able to run AMOScmp and find any known organisms that are present and build assemblies of those.
“For the sequences for which we don’t have reference genomes we will build a de novo assembly of those sequences and then for all of these we will run gene finders to find where the genes are and then provide the biologists with a list: Here are a set of contigs in our data, here are genes, and here are regions that indicate that there is genomic variation between organisms that are very closely related to each other.”
He also said he wants to develop binning algorithms, by looking at contigs and determining which are likely to have come from the same organism. Among the detectable similarities are, for example, indications of phage insertions, he said. Or it might be about running a screen of the findings against known antibiotic resistance factors.
For a few years now, he said, the need for new tools was known but only recently has the metagenomics climate shifted. “There have been enough publications that people are realizing the potential of the field and they are realizing you need to develop all the software,” Pop said. “A huge chunk of the environment can be sequenced and it is affordable.”
Other Metagenomics Tools
As the metagenomics data storm begins to whirl, one question is whether other informatics resources, such as the Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis, or CAMERA, can be brought to bear on the HMP.
CAMERA 1.0 was launched in early 2007 with $24.5 million in funding from the Gordon and Betty Moore Foundation with the aim of supporting metagenomics studies such as the JCVI’s Global Ocean Sampling expedition [BioInform 03-16-07].
Mark Ellisman, professor of neurosciences and bioengineering, director of the Center for Research in Biological Systems at the University of California in San Diego and CAMERA’s chief technology officer, told BioInform this week that the CAMERA tools are primarily geared toward serving the community of researchers in the area of marine and soil ecology.
“Because of the investment in acquiring metagenomic data, because of the investment in obtaining complete genomes of interesting microbial species in the ocean or soil samples, we are able to create with the funds in CAMERA a family of metagenomic analysis tools that we think will be unrivaled,” said Ellisman.
Scientists can do real-time Blast runs from the CAMERA site, he said. “Instead of waiting in a queue, we needed to enable the people who came to the site to be able to get high-throughput quick turnaround results on larger Blast runs than are normally available to the community,” he said, adding that the site gives users on-demand supercomputer access.
One potential drawback for the CAMERA database is that it requires registration to access the data and tools — a barrier that many in the bioinformatics community who are used to open tools from NCBI and elsewhere may view as cumbersome. Ellisman said that the site is “completely open” but some data is governed by agreements JCVI made regarding samples from territorial waters. Users must register to use the tools in order to give the CAMERA group feedback about the tools, he explained.
“We are now moving, and we will next year release what you can think of as a complete community cyberinfrastructure where one can stitch in different tools in pipelines to do more complicated processing using the CAMERA environment,” said Ellisman.
Separately from CAMERA, Ellisman his colleagues John Wooley and Paul Gilna are working on what he described as the “next-generation of resources for microbial metagenomics,” though he did not provide further details.