By Vivien Marx and Bernadette Toner
Last October, the National Institutes of Health awarded a $9.9 million, five-year grant to a team led by Owen White of the University of Maryland School of Medicine to establish a data-analysis and -coordination center for the Human Microbiome Project.
Now, just over a year later, the HMP DACC has many of the pieces in place to serve as the central repository for all HMP data, though some aspects of the complex project remain "in flux," according to Michelle Giglio, assistant professor at the Institute for Genome Sciences at the UM School of Medicine.
"There are really a bunch of pipelines and dataflows associated with this project, not just one," Giglio told BioInform via e-mail. Rather, she said, there are separate data-analysis pipelines associated with the three main activities of the HMP: reference genomes, 16S rRNA sequence analysis, and whole-metagenome analysis.
"Most of these pipelines are still in flux as we just start to get 16S and whole-metagenome data," Giglio said. "There are things in place for the reference genomes that were set up before we started on the project, but we are now in the process of developing a new pipeline for that as well."
Giglio said that thee are around 20 staffers affiliated with the DACC, though some of those researchers only spend part of their time on the effort. In addition to the UM team, the DACC also includes several subcontract teams at Lawrence Berkeley National Lab, led by Victor Markowitz and Gary Anderson; the Department of Energy's Joint Genome Institute, led by Nikos Kyrpides; and the University of Colorado, Boulder, led by Rob Knight.
According to the grant abstract for the project, the first aim of the HMP DACC is to create a Human Microbiome Data Store that will contain all data collected from the HMP data-generation centers, controlled vocabularies, a Human Microbiome Project Catalog tracking system, and links to standard operating procedures.
Another goal is to develop a "comprehensive computational analysis pipeline" that will include "an initial core set of elements and will be expanded over time as new tools useful to metagenomic analysis are developed either at the DACC or elsewhere." The DACC also set out to develop a data-integration and -analysis system, or DAIS, "which will employ numerous data reduction and integration systems as well as numerous data exploration tools that will be based on similar existing resources," such as JGI's IMG (Integrated Microbial Genomes) and IMG/M (Integrated Microbial Genomes System for Metagenomes) databases.
White explained to BioInform via e-mail that there is not too much need for the DACC to develop new analytical or annotation tools. "For the most part, many annotation methods that identify genes, determine their function, and are used in the comparison of one sample to another are being repurposed from previous studies developed here and elsewhere," he said.
He added, however, that one "rapid" area of software development "is the creation of sensitive statistical methods to identify common and unique elements found in microbiome samples. There are many variables associated with microbiome samples, and the techniques required to find true correlations in this data is a very exciting area of research."
DACC is furthest along in putting resources in place for the reference genomes — an effort to sequence a collection of microbial genomes that can serve as a benchmark for comparing novel sequence data. Giglio said that there are currently 1,227 strains in the HMP Project Catalog, and of these, "650 are somewhere in the sequencing process," while the others are targeted for sequencing.
Of the 650, "284 are 'complete' — that is, no more sequencing or finishing will be done for them," Giglio said. She added that most of these genomes that are considered complete are still drafts, however, and that only around 15 percent of the HMP reference genomes will be "taken to higher levels of finishing."
The HMP Project Catalog is a database for the reference genomes that is based on the GOLD (Genomes Online Database) data structure and managed by Nikos Kyrpides, GOLD's coordinator and data curator, along with LBL's Konstantinos Liolios.
Giglio explained that the database tracks 50 fields of information for each microbial strain, including information about the organism, project status, sequencing information, metadata about the organism, and links to additional resources with more information on the organism.
The complete genomes are also available in an HMP-specific section of JGI’s IMG database, where they can be compared with hundreds of other microbial genomes.
[ pagebreak ]
While "exactly who puts what data where is still being finalized for many of the data flows," Giglio said that "this is pretty well worked out for the reference genomes."
Under the current workflow, the DACC registers all HMP reference genomes with the National Center for Biotechnology Information, and then the four sequencing centers — the Baylor Human Genome Sequencing Center, the Broad Institute, the J. Craig Venter Institute, and Washington University — submit their sequence and annotation to NCBI when they are done with the genome.
"The DACC then pulls information from NCBI to display on our website and to include in the IMG analysis," Giglio said. While the centers only provide a small amount of annotation with their genomes, such as the gene product name, the DACC is responsible for adding more comprehensive annotations to each gene, including EC number and Gene Ontology terms, Giglio said.
Giglio said that the DACC team has performed quality-control analyses for the reference genomes for assembly metrics, and has performed pangenome analyses and identified unique genes. "This information is part of a manuscript that was just submitted for publication from all HMP people involved in the reference genomes work," she said.
16S RNA, Metagenomics, and Beyond
Giglio said that the HMP participants "are still trying to finalize the flow of data and to establish what will be the responsibility of the DACC and what will be the responsibility of the centers."
While the effort is "quite advanced" when it comes to the reference genomes, Giglio said that the DACC is "still working this out for the 16S data, which is just now starting to be generated by the centers, and it will remain to be established for the whole-metagenomic data."
She added that the "heavy lifting" on the 16S and metagenomic-analysis pipeline development has been done by Gary Anderson and Todd DeSantis at LBL and Rob Knight at the University of Colorado.
The HMP is sequencing 16S ribosomal RNA — a subunit of the prokaryotic ribosome that is conserved throughout bacteria — in order to characterize the complexity of microbial communities at different sites of the human body, and to determine whether there is a "core microbiome" at each site.
Variable regions of 16S rRNA are routinely used to classify organisms according to phylogeny and are particularly useful in metagenomics to help identify what taxonomic groups are present in a given sample and in what abundance.
HMP researchers will perform 16S rRNA sequencing on 250 healthy adult men and women between the ages of 18 and 40, recruited at Baylor College of Medicine and Washington University. Clinical specimens will be collected from five body sites: mouth, skin, nose, gastrointestinal tract, and vagina.
So far, Giglio said, the DACC has established a website for the centers to submit 16S metadata and traces for further analysis. In addition, she said that they have established file format standards for capturing the minimal metadata about nucleic acid preparations and library construction.
The workflow for metagenomic whole-genome shotgun sequencing, which will be performed on the same human subjects as the 16S RNA sequencing, has yet to be determined, Giglio said, but she noted that it will "likely be similar to what is set up for 16S."
One key to the effort is capturing metadata about sampling processes, lab protocols, and human subjects. "We have established certain minimum data requirements that must be captured about certain lab practices," Giglio said.
"Equally important to all of this is establishing a defined system of identifiers so that when one specimen produces many samples, and each of those samples produces many libraries, all of these can be correctly linked to each other so that all data from a single specimen can be reliably retrieved and tracked."
Giglio noted that the DACC is "trying to put as much of this in place before we start getting data as possible," but acknowledged that is not always possible. In those cases, "we will retroactively apply standards to data whenever possible."
She added that the HMP DACC is working with the Genome Standards Consortium "to try to employ their standards whenever possible and to contribute to the development of their standards so that they are applicable to HMP data."