The advent of next-generation sequencing technology helped members of the Human Microbiome Project realize the goal of developing a resource representing microbial communities associated with healthy humans, according to HMP consortium member Bruce Birren.
Both 454 and Illumina sequencing methods made it possible to "generate the largest metagenomic data set yet produced," explained Birren, who directs the Broad Institute's genomic sequencing center for infectious diseases and co-directs its genome sequencing and analysis program.
Birren was speaking during a telephone press briefing last week announcing the publication of a collection of HMP studies in Nature, Genome Biology, and Public Library of Science journals.
"To really do sequencing of communities of microorganisms — whether you're doing targeted sequencing of 16S ribosomal RNA or you're doing shotgun sequencing to get entire genomes or better sampling of entire genomes — it wasn't until next-generation sequencing came along that you could even consider that," added George Weinstock, associate director of Washington University's Genome Institute and leader of that center's HMP efforts.
"Once next-generation sequencing was available, suddenly the whole world of metagenomics opened up," he told In Sequence, "and that was the real tipping point for this whole field."
And because sequencing technology has advanced so quickly, researchers are now looking at deploying techniques such as single-cell sequencing to further study some of the most interesting HMP samples, and taking advantage of analytical methods developed during the HMP in ongoing studies on relationships between the microbiome and disease.
Hundreds of researchers from 80 different research centers have participated in the HMP, an effort launched in late 2007 to characterize microbial communities in and on the human body.
Researchers at HMP clinical centers based at Washington University and Baylor College of Medicine tested 300 healthy volunteers from St. Louis and Houston for the study, sampling 15 to 18 specific sites representing the oral and nasal cavity, urogenital tract, skin, and gut between one and three times per person.
At sequencing centers based at Washington University, the Broad Institute, the J. Craig Venter Institute, and the Baylor College of Medicine, HMP members relied on Roche 454 16S ribosomal RNA sequencing to look at bacterial diversity and abundance of microbial communities in each of 12,000 samples collected from the healthy participants. A subset of these samples was also tested by Illumina GAIIx metagenomic sequencing, which provides a peek at the complete gene content in a given microbial community.
In addition, the HMP consortium has thus far sequenced more than 800 new microbial reference genomes in its effort to plump up reference databases and help interpret its other sequence data.
With the HMP now in its final of five years, researchers at HMP-affiliated sequencing centers are not only wrapping up sequencing and analysis on samples from the last few dozen healthy individuals, but are also turning to the Illumina HiSeq 2000 to generate metagenomic sequence data on a few hundred extra samples.
They are also continuing to ratchet up the microbial reference genome count and are on track to bring the number of reference genomes sequenced through the HMP to around 3,000 by late this year or early next.
In the meantime, HMP members have already started churning out a slew of studies, including 17 new papers out last week alone, which described microbiome patterns in the first 242 healthy individuals assessed by HMP researchers and provided details of the sequencing and data analysis strategies developed for the effort. A few of the new publications also offered a look at some of the HMP demonstration projects that are underway to find microbiome shifts linked to various health and disease states.
Both the quality-controlled datasets used for the current studies and the reference genome sequences generated so far have been released through the HMP's Data Analysis and Coordination Center, or DACC.
The HMP is supported by a $153 million investment from the NIH Common Fund. Other institutes within the NIH have kicked in another $20 million for the effort. Even with these funding commitments, though, those involved in the HMP say it would not have been possible without a dramatic drop in sequencing costs over the past few years.
"Because of the massive number of samples involved, an effort like the Human Microbiome could not have been considered if the cost of DNA sequencing had not plummeted in recent years," Eric Green, director of the National Human Genome Research Institute, told reporters during last week's telebriefing.
"When we started the [HMP] in 2007," he added, "we suspected that the new DNA sequencing machines that were just arriving in genomic laboratories would profoundly reduce the cost of DNA sequencing to the point of making large-scale microbiome research feasible."
Despite the importance of high-throughput sequencing during the main phase of the project, though, the earliest HMP reference genome sequencing and 16S pilot efforts relied on Sanger sequencing. It wasn't until shortly before the HMP secured its designation as a Roadmap program in 2007 that the team started moving toward a wholesale switch to Roche 454 for 16S rRNA sequencing efforts (IS 6/19/2007).
"When the HMP was being conceived in 2006, and even when the funding started in 2007, next-generation sequencing was not widely applied," Weinstock explained.
"Everybody believed that for 16S sequencing, you had to do full-length sequencing of 16S ribosomal RNA genes," he added, "because that's the only way you could get a really accurate definition of what a species was."
Long Sanger sequences representing all, or almost all, of the 1,500 bases found in the 16S rRNA gene sequence targeted for bacterial barcode sequencing were especially appealing because HMP members had aspirations of coming up with sequence that could serve as a reference for future microbiome efforts, added Broad Institute researcher Dirk Gevers, one of the HMP Data Analysis Working Group co-chairs.
On the other hand, the group wanted to be able to look at many, many samples — a feat that was difficult and expensive prior to the introduction of second-generation sequencers.
The decision became a bit easier with the advent of the Roche 454 Titanium, which offered both throughput and reads on the order of 400 to 500 base pairs, Gevers told IS; and, in the HMP's second year, researchers transitioned over to the Roche 454 Titanium, the platform that would eventually be used to generate 16S sequence data for all 12,000 healthy participant samples.
"We rapidly saw in the first years of the project that 454 sequencing became accepted as a way of doing 16S ribosomal RNA [sequencing]," Weinstock noted, "so all of the 16S sequencing that was done for the project was done using the 454 platform."
Even with the availability of 500 base pair Titanium reads, though, researchers had to spend a good deal of time figuring out which stretches of the 16S gene would be most informative — and design primers for producing these amplicons accordingly.
"When we were considering [454] Titanium, we wanted to generate amplicons that were maximally benefiting that sequence length," Gevers said. "So that's why we went looking for primers that would result in an amplicon of 600 bases or around that size."
The team soon realized that the need for standardization applied to more than just the primer pairs used to produce amplicons, Gevers noted, with early analyses hinting at the potential perils of trying to use data that had not been prepared, sequenced, and analyzed in a standardized way.
After a few months of hammering out a standardized 16S sequencing protocol during weekly telephone conferences, representatives from the four HMP sequencing centers began what Gevers called a "big crank" of sequencing in the spring of 2010 when much of the existing HMP 16S sequence data was produced.
Around the same time, the team also started to do metagenomic sequencing on a subset of the HMP samples, using the Illumina GAIIx to generate far more sequence data per sample than was the norm at the time. Whereas most metagenomic studies relied on a few gigabases of sequence per sample, Gevers explained, HMP researchers set their sights on around 10 gigabases — or two lanes of GAIIx sequence data — per sample.
So far that strategy has produced around eight terabases of metagenomic sequence data representing microbial communities from six body sites. By the end of the year, the HMP team plans to sequence several hundred more samples from the healthy cohort, now using the HiSeq 2000.
"We saw during the course of the project the tremendous increase in throughput per Illumina flow cell, as well as the decrease in cost per sample," Weinstock said. "The result of that was that we were able to do this very deep sequencing on, at this point, over 1,000 samples."
Advanced Analysis
Just as it has benefited from the advent of new sequencing technologies, the HMP has also bolstered the computational methods available for dealing with the various data types generated for the effort, according to those involved.
That was particularly true when the project mushroomed to include investigators with interest and/or expertise related to microbiome analysis from dozens of centers beyond the original HMP clinical and sequencing centers and the DACC, headquartered at the University of Maryland.
"There were really some remarkable things that had to be done in terms of managing this large dataset and doing the computes on it in a timely fashion," Weinstock said. "That was another big achievement that would not have happened without the needs of the HMP."
For the reference genomes, for example, researchers have had to come up with ways of handling the new sequence data types and to develop more automated genome annotation pipelines, while the switch from Sanger to 454 sequencing for 16S experiments necessitated changes to help deal with the distinct error models of the two approaches.
Several new methods had to be developed to deal with metagenomics-related computational problems, too — from removing human reads that contaminated the data to figuring out how to search terabases of shotgun sequence data against existing reference databases to coming up with ways to unravel the metabolic capabilities of a microbial community based on its gene content.
Despite the project's success so far, though, there are some secondary goals from the HMP study of healthy individuals that have not been realized, Weinstock noted, including transcriptome sequencing on samples from healthy individuals.
There has also been far less work done to unravel the viral and eukaryotic components of healthy human microbiomes than has been done to catalog bacterial members of these communities, he added, noting that viral and eukaryotic reference sequence studies have lagged behind somewhat as well.
Future Directions
Still other challenges remain as the project moves forward. Now that they have a much better idea of the organisms that are found in and on healthy individuals, for example, Ashlee Earl and her colleagues at the Broad Institute have narrowed in on a set of interesting but uncharacterized organisms for follow-up studies.
While these 'most wanted' organisms are identifiable from existing sequence data, Gevers explained, characterizing them in more detail will likely hinge on advances in microbe culturing or single-cell sequencing approaches.
Some HMP members have already started working on methods that may help, Weinstock noted, including JCVI researcher Roger Lasken, who has attempted to do sorting and single-cell sequencing on uncultured bacteria in HMP samples (IS 9/20/2011).
"I suspect, all in all, that we'll do 200 or 300 of those kinds of samples," Weinstock said.
"The problem so far has been that the quality of the draft sequence that you get from these is just not as high as you get from a cultured sample," he added. "Because this method relies a lot on whole-genome amplification of the single-cells … you get a lot of biases in which regions get amplified and don't."
While such biases can hinder genome assembly, the availability of sequence data on uncultured organisms is expected to help tie a given sequence to its microbial source in various metagenomic mixes.
Down the road, there is also a ways to go in understanding the interplay between human genes and the microbial communities found at sites across the body — another area that the HMP has been interested in since its outset.
Participants in the HMP have already had blood samples drawn and have been consented for such analyses. And although researchers have not yet done any genetic studies using DNA from these blood samples, they do have genome sequence data for many of the healthy HMP participants — generated as a byproduct of their microbial metagenomic sequencing work.
In one of the Nature studies out last week, HMP consortium members noted that nearly half of the metagenomic reads generated for HMP were human contamination.
The proportion of human DNA sequence in the metagenome varied dramatically depending on the body site sampled, researchers reported, with stool samples containing almost exclusively microbial DNA and skin samples yielding mainly human reads, for example.
"The coverage that you get from these [samples] is different from the different body sites," Weinstock explained. "But certainly for an individual you can combine all of their human sequences together and have some coverage so that you can do some genotyping."
That realization prompted the development of new methods for filtering out human sequence reads prior to placing HMP metagenomic sequence data in publicly available databases, Amy McGuire, associate director of research at Baylor College of Medicine's Center for Medical Ethics and Health Policy, noted during last week's telebriefing.
To maintain participant privacy, McGuire explained, "project leaders decided to put any human DNA that was analyzed into a control database so that only approved researchers could access it."
Although it is absent from publicly accessible data repositories, researchers have started tapping into the human genetic information to begin looking at how human variants relate to microbiome features.
At the Biology of Genomes meeting last month, for example, Cornell University researcher Ran Blekham reported that he and his colleagues have been able to cobble together genome sequence information for about 100 HMP participants, identifying SNPs that seem to coincide with microbiome features on the skin and in the gut (GWDN 5/10/2012).
Such efforts are expected to continue — as are demonstration projects focused on finding microbial community patterns that coincide with diseases such as Crohn's disease, bacterial vaginosis, and necrotizing enterocolitis.
While there are murmurs about the possibility of an HMP2 effort, the NIH has not announced whether it will fund a second phase once the initial five-year project wraps up.
Regardless, researchers say it seems likely that specific institutes within NIH will continue to fund microbiome studies to answer health- and disease-related questions that are relevant to the broader goals of each institute.
"At this point, many of the institutes have, or will soon have, microbiome projects," Weinstock explained. "If you didn't have an HMP2, you would still have a lot of organized HMP programs at NIH under these different institutes."
Gevers, too, said it seems likely that efforts related to HMP will continue given the growing interest in and awareness of microbiome research both in the US and internationally.
"One thing is sure: the research will continue and there are many more areas to go in than we already have tackled today," he said.