NEW YORK – An international team led by investigators at the European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) has assembled collections of human gut microbiome reference genome sequences and corresponding protein products.
"Given the large uncultured diversity still remaining in the human gut microbiome, having a high-quality catalog of all currently known species substantially enhances the resolution and accuracy of metagenome-based studies," senior and co-corresponding author Robert Finn, a microbiome informatics researcher at EMBL-EBI, and his colleagues wrote in a paper appearing in Nature Biotechnology on Monday.
For the Unified Human Gastrointestinal Genome (UHGG) and Unified Human Gastrointestinal Protein (UHGP) catalogs, respectively, the researchers brought together almost 205,000 genome sequences representing more than 4,600 gut microbes, along with the proteins produced by the more than 170.6 million genes in this collection. Together, it offered a look at the genetic diversity within and between gut microbial species, including differences in the bugs found in gut samples from populations in different parts of the world.
"Intra-species genomic variation analyses revealed a large reservoir of accessory genes and single-nucleotide variants, many of which are specific to individual human populations," the authors reported, noting that UHGG and UHGP "will enable studies linking genotypes to phenotypes in the human gut microbiome."
Using metagenome-assembled genome sequences and cultured gut microbes from several large public databases, the researchers put together a set of nearly 287,000 genomes spanning 204,938 non-redundant genome sequences for 4,644 inferred prokaryotic species in human gut samples from individuals in dozens of countries.
"Genomes were recovered in samples from a total of 31 countries across six continents (Africa, Asia, Europe, North America, South America, and Oceania)," the authors noted, "but the majority originated from samples collected in China, Denmark, Span, and the United States."
Some 3,207 of those genomes were more than 90 percent complete, the team reported, while 573 genomes could be classified as high quality. In a series of follow-up analyses, the investigators considered each species taxonomically, defined core and accessory genomes, and explored the genetic heterogeneity within species and strains.
Across the full set of 286,997 genomes, the researchers predicted almost 625.3 million full-length protein sequences, including many of the proteins previously reported in IGC, a collection that encompasses proteins found through the Human Microbiome Project or by members of the Metagenomics of the Human Intestinal Tract, or MetaHIT, consortium.
"With the establishment of this massive sequence catalog, it is evident that a large portion of the species and functional diversity within the human gut microbiome remains uncharacterized," the authors reported, noting that "knowledge of the intra-species diversity of many species is still limited owing to the presence of a small number of conspecific genomes."