Skip to main content
Premium Trial:

Request an Annual Quote

Gene Co-variance Approach Shows Promise for Identifying, Assembling Microbial Genomes from Metagenomic Sequences


NEW YORK (GenomeWeb) – An international team that includes members of the "Metagenomics of the Human Intestinal Tract," or MetaHIT, consortium has come up with a new method for putting together microbial genomes and identifying species from metagenomic sequence data.

The co-abundance-based approach begins by grouping sets of genes that tend to occur together across multiple microbiomes. These co-abundance gene groups (CAGs) can then be used to not only identify species present in microbiome samples, but also to guide the assembly of genomes from strains found in a given microbiome.

In a proof-of-principle study appearing online this weekend in Nature Biotechnology, the researchers applied this approach to metagenomic sequence data generated from hundreds of European fecal samples. Those gut microbiomes contained nearly 7,400 CAGs, they reported — gene clusters that helped in constructing hundreds of high-quality microbial genome assemblies.

The study's co-senior author Dusko Ehrlich, a researcher affiliated with the French National Institute for Agricultural Research (INRA) and King's College London, told In Sequence that the method is a "tremendous improvement" in the metagenomics field.

In particular, he argued that the approach should allow for advances in functional studies of microbial communities by providing a more complete look at the bacteria found in each microbiome and their interactions and inter-dependencies with other genetic factors.

The new metagenomic sequence analysis method was largely motivated by an effort to advance MetaHIT, a European effort aimed at finding gut microbiome features that are related to various health or disease-related traits.

For example, members of MetaHIT have considered gut microbial patterns associated with conditions such as inflammatory bowel disease and obesity — work that revealed a decline in microbial species richness and gene diversity in those prone to inflammatory bowel disease or obesity.

In an earlier study published in Nature, the MetaHIT team presented a reference catalog of 3.3 million microbial genes detected in the human gut, a set identified through metagenomic sequencing on samples from 124 European individuals.

Efforts to use that reference gene set as a resource for finding specific gut microbe genes associated with obesity, IBD, or other human traits has proven computationally and statistically tricky, Ehrlich explained.

"If we could assign these millions of genes to thousands of objects — the genomes that carry these genes — we would have a much more detailed view of our microbiome and be better able to assess its potential impact on our health and disease," he said.

By matching the genes swirling around in the microbial community mixture to specific organisms, the team also anticipated an opportunity to get a better sense of the viruses, plasmids, and other small genetic elements associated with these microbes.

For microbes with existing genome assemblies, it's a simple task to match metagenomic reads to the reference, Ehrlich noted. But just a fraction of the genes present in most microbiome samples can be mapped to existing reference genomes, due to the bacterial diversity found in nature and the fact that many microbes are difficult to cultivate and/or are relatively rare.

Researchers have explored a variety of strategies to get around this problem and put together genomes directly from metagenomic sequences in the past. In a study published earlier this year, for example, a University of Washington team used Hi-C sequencing to assemble individual genomes from metagenomic sequence slurries.

Some have attempted to pluck sequences from a single bug out of a metagenomic sequence mixture and assemble them with the help of mate-pair sequences, coverage depth data, or tetranucleotide frequency in the metagenomic sequence itself, while still other teams are turning to single-cell microbial genome sequencing alone or in combination to produce full genome sequences for microbes from environmental mixtures.

For their part, Ehrlich and his colleagues pursued a gene binning approach based on the notion that genes found in the same microbial genome should turn up with the same abundance in metagenomic sequence data. Consequently, they reasoned that genes from the same bug should show the same co-variance patterns in samples from each individual tested.

In their current study, for example, the researchers did Illumina sequencing on DNA isolated from hundreds of stool samples — 396 in all — that had been collected from 177 participants from Denmark and 141 Spanish participants.

The Spanish cohort contained 13 individuals with Crohn's disease and 69 individuals with ulcerative colitis, another form of inflammatory bowel disease. It also included 77 individuals sampled who were sampled twice over a span of around six months. Meanwhile, the Danish group represented individuals across a range of body mass indexes.

After generating around 40 to 50 million Illumina reads per sample, the researchers matched these sequences to a version of the MetaHIT gut microbial gene catalog that contained almost 4 million genes, keeping tabs on the abundance of each gene in participants' gut microbiomes.

That analysis uncovered 7,381 co-abundance gene groups — a set that included co-varying genes from individual bacterial genomes as well as co-occurring viruses, genetic elements, and CRISPR defense genes. Since most bacterial and archaeal genomes are comprised of at least 700 genes, the CAGs reaching that cutoff were designated as likely metagenomic species.

From there, the researchers tried taking the approach a step further: assembling sets of the co-occurring genes into complete bacterial genomes. To do that, they captured contigs from metagenomic sequence data using the gene clusters defined in the first stage of the analysis. This produced chunks of sequence they called "scaftigs" that could be assembled into full genome sequences using the SOAPdenovo assembler.

The team decided against pooling reads from multiple individuals for the most part, in light of findings that suggested different individuals typically carry distinct strains of various bacteria in their guts, Ehrlich noted. Instead, the group identified participants with a slew of genes from a particular species and developed genome assemblies that represented each person's particular gut microbe strains.

Using that strategy, the study's authors put together 238 high-quality genome assemblies, on par with those produced by sequencing cultured bacterial isolates based on criteria established by the HMP. Of those, 181 genomes came from bugs not sequenced previously.

Amongst the microbes that had been characterized previously was a Bifidobacterium species included in some yogurts. Using sequences from one of the 19 study participants who'd consumed the Bifidobacterium species in their food, the researchers assembled a genome assembly that was 99.9 percent identical across around 95 percent of the bug's reference genome sequence.

Additional genome assemblies produced somewhat lower quality scores based on HMP criteria, Ehrlich noted, while another 100 high-quality genomes were assembled from multi-sample sequences.

The method makes it possible to determine the genome sequence for a given bug without necessarily knowing how to grow it in a lab. Conversely, though, Ehrlich noted that information from newly assembled genomes may offer clues to cultivating at least some of the bacteria.

By developing genome assemblies that represent the diverse forms of each microbe that can occur across individuals, he and his colleagues expect to get a glimpse at the core genes that are shared across strains from gut microbial species as well as the more dispensable accessory genes that may or may not be present.

Such pan-genome data is further complemented by information on the smaller genetic elements and/or bacteria-infecting phage viruses associated with each microbe.

In the 396 gut microbiome samples that were considered in the current study, the researchers used their co-variance approach to narrow in on almost 850 apparent phages — bacteria-infecting viruses that can affect the growth and persistence of their host microbes.

"Although may of the CAGs and their dependency association[s] are not understood at present," they wrote, "our findings suggest that even small CAGs represent biologically meaningful entities, either in the form of phages or clonal differences of microbial species."

The team hopes to continue improving their descriptions of organisms and related genetic elements present in microbiomes in the human gut and to better define the counters and variability that exists across strains from the same species.

Though the work described in Nature Biotechnology was done using Illumina instruments, the same approaches should be compatible with other sequencing platforms as well. For example, Ehrlich noted that he and his colleagues are currently performing similar studies using SOLiD sequencing systems.

The team estimated that at least 18 samples are needed to perform the type of gene binning and co-variance analysis described in the study, though additional samples are expected to yield better-defined gene groups.

The same reference-free methods are expected to be applicable to any type of host-associated or environmental microbiomes containing microbial genes described in existing databases.

"The predicate is to have this gene catalog, which includes genes from different samples," Ehrlich said. "If you do have a [gene] catalog it will work for any samples."

MetaHIT's own microbial gene catalog continues to grow. In another Nature Biotechnology study published this week, researchers involved in the effort described an updated set that contains nearly 10 million gut microbial gene sequences identified using metagenomic sequence data on samples from around 1,300 European individuals.