Skip to main content
Premium Trial:

Request an Annual Quote

Stanford Team Uses Imputation Approach to Uncover Thousands of New Microbiome Proteins


CHICAGO – A team led by researchers at Stanford University has discovered thousands of new proteins in the human microbiome by applying a series of computational techniques to impute the proteins despite the fact that most had no reference genome.

The work recently appeared in an article in Cell. Stanford hematologist and geneticist Ami Bhatt, who runs the laboratory that managed the work, also discussed the findings at the Cold Spring Harbor Laboratory conference on microbiomes and at the Intelligent Systems for Molecular Biology and European Conference on Computational Biology (ISMB/ECCB) conference, both in July.

At ISMB/ECCB in Basel, Switzerland, Bhatt said that the work represents an important step forward in what she called "precision clinical microbiology in the age of sequencing."

Starting with the US National Institutes of Health's Human Microbiome Project dataset, Bhatt's laboratory ran a comparative genomics study on 1,773 "human-associated" metagenomes from the gut, mouth, skin, and vagina. The researchers found 4,539 protein families, most of which they considered novel in that they are not in "traditional" reference genomes or don't include a known protein domain, according to the Cell paper.

By re-analyzing the data from a previously published metatranscriptomic dataset, the Stanford-led team also showed that a subset of the small-protein encoding genes that they predicted are actually transcribed. 

"By classifying the protein families according to their taxonomic distribution, their prevalence across human body sites and non-human metagenomes, their predicted cellular localization, their genomic neighborhood and more, we assign putative functions to a subset of the families," they said.

Bhatt told GenomeWeb that her team tried various methods to see if they could find "hints" about the functions of these proteins. Techniques included: taxonomic distribution; figuring out which of the microbiome areas studied encoded the proteins; determining if the proteins might be intracellular, transmembrane, or secreted; and studying genomic neighborhoods.

"This allowed us to create hypotheses regarding potential functions for a subset of these genes," Bhatt said. 

They found that "small proteins are highly abundant and those of the human microbiome, in particular, may perform diverse functions that have not been previously reported," according to the paper.

"We think that these different proteins may be involved in many aspects of biology," Bhatt said.

The researchers also noted that typical gene annotation leaves out small open reading frames (sORFs) and associated small proteins, which hinders the process of matching genes to phenotypes. Their technique leads to a more complete picture of proteins in the human microbiome, according to Bhatt.

Notably, they identified 39 antimicrobial peptide families that may be novel, but require external validation. 

The goal of the research, according to Bhatt, was to mine volumes of sequencing data in an effort to better understand the genes in microorganisms in and on the human body.

"We're interested in identifying the dark matter of the human microbiome, be that new organisms and their genomes by applying approaches like the long-read approach, but also within those genomes trying to identify genes that were traditionally overlooked," said Bhatt, who also is a faculty fellow at the Stanford Chemistry, Engineering & Medicine for Human Health (ChEM-H) interdisciplinary institute.

The paper's first author, Hila Sberro Livnat, a postdoctoral fellow in microbial genomics and computational biology at Stanford, put forth the idea that the researchers would use existing computational methods intended to find signals of conservation in an effort to reduce false positives.  

Bhatt further explained that "it's really easy to get start and stop codons really close to one another, and then you're erroneously calling a small gene when there really isn't one. So I think that concept of using conservation was a really clever one. After that, we basically selected and applied a lot of methods that helped us increase our confidence that these calls were correct, doing things like trying to see if we could identify other features of real genes like ribosomal binding sites."

Bhatt said that each piece wasn't terribly challenging from a computational perspective since her lab did not develop new algorithms. "Rather, we were taking a creative idea and implementing it kind of in force en masse. That is what I think the real contribution was here."

Anthony Finbow, CEO of Eagle Genomics, a UK-based, microbiome-focused bioinformatics firm, called this approach "original" and "pioneering" in that it imputes function to small proteins without reference data.  

"The identification of this number of newly discovered small proteins is a significant achievement. It provides the foundation for a vast new area of potential application and research," Finbow said by email. "With the proliferation of such microbiome discoveries, novel approaches for collaboration are increasingly necessary."

Bhatt noted that identification of small genes is critical to understanding human-associated microbiomes because such microbiomes are otherwise difficult to annotate.

"We have a health-oriented focus on my lab and what we wanted to do was identify genes that were present or enriched in human-associated microbes that weren't, for example, present in other environmental systems," Bhatt said.

She also wanted to find genes present in both human and environmental microbes. "If we found small genes that were present in human-associated microbes but also in soil microbes and ocean microbes, well, that suggests that these are genes that are involved in very basic biological functions that are shared by many of these organisms or are conserved across a diverse array of environments," Bhatt said.

On the other hand, finding genes specific to humans or even a specific part of the human body might suggest a role in physiology or disease.

Although one of the researchers represents a for-profit bioinformatics company, San Francisco-based One Codex, Bhatt said there are no plans to commercialize the technology.

According to Bhatt, One Codex CEO Nicholas Greenfield, happened to have developed a pipeline for taxomic classification that was useful for this study, so he joined the research project.

The US Department of Energy's Joint Genome Institute at Lawrence Berkeley National Laboratory helped Stanford access much of the environmental metagenomic data they studied. "They also helped carry out some of the computational analyses, especially on those environmental data because the Joint Genome Institute focuses on nonhuman microbiome work," Bhatt noted.

The Joint Genome Institute helped fund the work, as did NIH, and the drug industry's PhRMA Foundation.

While Bhatt's lab identified potential new genes, they left it to future researchers to make associations with health conditions.

"We did provide some suggestions of what these genes might do and infer functions by what genes are nearby," Bhatt said, for example. "But in terms of truly de-orphaning these genes at a functional level, that will require a slow, careful, and iterative process. Our hope is that this will form the basis of that."

Bhatt said that her lab is interested in developing the informatics to improve prediction of genes and gene functions. "We're also very interested in potentially generating these proteins [with] predicted antimicrobial activity [for example], so we're really interested in continuing to invest in this area," she said.

"With ever more precise measurements of which organisms are there and what these organisms encode in terms of their gene, we can develop an increasingly precise understanding of the organisms that coexist with us, and with that, we can better understand how they may relate to our health," Bhatt said.