Skip to main content
Premium Trial:

Request an Annual Quote

New Bioinformatic Approach Enables Strain-Level Microbial Profiling From Metagenomic Data

NEW YORK (GenomeWeb) – A team led by researchers at the University of Trento in Italy has devised a bioinformatic approach to glean microbial strain-level information from metagenomic data.

As they reported today in Nature Methods, the researchers developed a software tool, dubbed pangenome-based phylogenomic analysis, or PanPhlAn, that detects and characterizes strain-specific gene content from metagenomic data. In testing the method, the researchers found that it could recognize outbreak strains and, in combination with metatranscriptomic data, profile the transcriptional activity of strains within the larger community.

"Whether pursuing strains of key pathogens or microbial species that have been under-investigated because of cultivation challenges or low pathogenic potential, PanPhlAn can characterize strain-specific gene content and transcriptional profiles within biologically relevant communities," Trento's Nicola Segata and colleagues wrote in their paper.

While metagenomic-based analyses enable investigators to study microbiomes without the need for growing them in the lab, it can only occasionally resolve shotgun data below the species level, the researchers noted. To do so, they added, requires computation-intensive assembly of pooled samples and such an approach can't be applied to studies with thousands of samples or to low-abundance organisms.

Their PanPhlAn approach, by contrast, relies on an input metagenome in a fastq format and a species-specific pangenome database. The analysis of strain-specific gene sets, the researchers said, enables a glimpse into the microorganisms' functional and pathogenic potential, while the addition of reference genomes allows new and known strains to be uncovered.

To build the pangenome for a species of interest, the tool extracts all genes from available reference genomes and combines them into gene family clusters. Based on gene family co-abundance within the metagenomic sample, it then identifies strain-specific repertoires.

Segata and his colleagues tested their tool on synthetic and semi-synthetic datasets to report that PanPhlAn could detect strain-specific gene repertoires with an accuracy of 92 percent at 2X coverage and an accuracy of more than 98 percent at 10X coverage. This, they said, indicates that their tool can resolve strains better than other metagenomic assembly and strain-tracking tools, such as MetaPhlAn2 and ConStrains.

They also tested PanPhlAn on metagenomic data obtained from the 2011 Escherichia coli outbreak in Germany that was caused by an enteroaggregative strain, O104:H4, that had acquired a Shiga toxin-encoding prophage as well as virulence and antibiotic resistance factors.

After excluding pathogenic E. coli strains from the reference database, cluster analysis based on the PanPhlAn gene family profiles placed the outbreak strains in clusters distinct from the other E. coli subclades. Their subsequent study of genes from the outbreak-linked cluster uncovered the virulence and resistance factors that had been discovered by sequencing the outbreak isolates, including aggR, stx2, and tetA.

Analysis using PanPhlAn also uncovered a suite of outbreak strain-enriched gene families and pathways, including ones involved in virulence-associated gene families and heavy metal tolerance.

When the researchers added the outbreak strain back into the reference database, PanPhlAn enabled a high-quality reconstruction of the pathogenic strain that they said outperformed current tools and suggested that the software could accurately identify outbreak strains from metagenomic data.

Segata and his colleagues further applied PanPhlAn to 1,316 publicly available human gut metagenomes to profile any E. coli strains present. The diversity of these E. coli strains was high, they noted, fell into six functionally distinct clades, and represented the four major E. coli phylogroups.

Samples from the German outbreak, they wrote, formed a subcluster, and network analysis uncovered a strain closely related to the German outbreak in a set of Chinese individuals. PanPhlAn analysis noted, though, that the Chinese strain lacked the Shiga toxin-encoding region that the German strain has.

The researchers also used PanPhlAn to analyze Eubacterium rectale and Akkermansia muciniphila, two common gut microbes,in 1,830 gut metagenomes from eight large-scale studies. It clustered E. rectale into three clades, and the two Chinese cohorts in the study data clustered away from the European and North American ones, suggesting a possible geographic adaptation.  

For A. muciniphila, meanwhile, they uncovered six clades present in all populations.

Segata and his colleagues also used their tool to examine the metagenomes and metatranscriptomes of five stool samples obtained from five healthy infants. In particular, they focused on the in vivo transcriptional activity of E. coli. Through PanPhlAn, they found that transcription rates across the infants were largely similar, with ribosomal and stress response genes being the most transcribed.