This story was originally posted April 15.
A recent computational study has determined that gene-prediction algorithms for prokaryotic genomes fail to detect small genes belonging to nearly 400 gene families that are currently not represented in any annotation databases.
The study, conducted by researchers at the Virginia Bioinformatics Institute and Virginia Tech and published recently in BMC Bioinformatics, used mpiBlast to perform an all-against-all sequence search of 780 microbial genomes in order to identify regions of similarity that indicate likely coding regions. After filtering for alignments that represent different taxonomic families, the team estimated that at least 1,153 candidate genes from 380 gene families are missing from current prokaryotic genome annotations. In addition, the researchers identified nearly 39,000 intergenic open reading frames that are similar to currently annotated genes, which they termed "missing annotations."
The "vast majority" of the missing genes were shorter than 100 amino acids, which actually makes sense given the way that gene-finding programs are designed, according to Joao Carlos Setubal, a VBI researcher and an author on the paper.
Gene-prediction algorithms are "all statistical in nature, and they're all trying to pick up the statistical signals of protein-coding genes," Setubal said. "It is just a statistical fact that the longer the gene is, the stronger the statistical signal will be — so much so that it's a rule of thumb in this business that any ORF in a prokaryotic genome that has more than a thousand base pairs is very, very likely to be a protein-coding gene."
Small genes — on the order of tens of amino acids — "may not have enough statistical signal for these gene finders to reliably pick them up," he said.
To confirm this, the VBI team analyzed several subsets of the data with the gene-finding programs Glimmer, EasyGene, and GeneMark and found that in most cases at least one of the software tools failed to find the potentially missing genes.
Mark Borodovsky, director of Georgia Tech’s Center for Bioinformatics and Computational Genomics and the developer of GeneMark, agreed that existing tools have trouble when it comes to identifying small genes, but said that's by design.
"Standard protocols of gene finding have to put a threshold on the minimal length of computationally predicted genes in order to avoid a large number of false positive predictions. Inevitably, some real short genes are missed," he said. As a result, the VBI team's findings "could be expected."
He added that the study uncovered a "surprisingly" small number of missed genes, noting that the average turned out to be only two missed genes and 50 missing annotations per genome.
Setubal said that he and his colleagues didn't have any expectations going into the study about the extent of missing genes, but stressed that they took a very stringent approach in terms of the criteria that they used to determine whether a given set of sequences was a coding gene or not.
"We wanted to be very conservative in terms of which sequences we would list as gene candidates, and therefore the numbers I think are relatively small," he said. "But I believe that based on our findings there should be more out there, and I think the paper sort of points the way on how these other genes can be detected so that you can create a more complete picture of the gene coding information in prokaryotic genomes."
[ pagebreak ]
Setubal acknowledged that it won't come as a shock to anyone that some genes are being overlooked, and ceded that several previous studies have shown that gene finders have missed small genes in specific genomes. Nevertheless, he said that the VBI study "was the first to do a very large-scale survey to try to discover whether this phenomenon was pervasive among genomes, and the answer was basically yes."
Indeed, no one has performed such a comprehensive assessment of all the prokaryotic genomes in GenBank before because it has been computationally challenging. For this study, the VBI team relied on an "ephemeral supercomputer" that aggregated more than 12,000 processor cores from seven different supercomputers across the US. According to the researchers, the project would have taken 90 years on a single PC, but required only 12 hours on the distributed system.
Setubal stressed that the computational study is just a first pass at potentially overlooked genes. "All we did was uncover gene candidates. Whether they are real genes or not depends on experimentation," he said.
He noted that experimental verification of some of the missing genes in model organisms would go a long way toward improving gene prediction for newly sequenced prokaryotic genomes, since once they are included in an annotation database, they would serve as a reference for gene-finding programs.
This information could benefit metagenomics projects, for example, because it uncovered gene families that are currently absent from any annotation databases. "If members of these families are validated experimentally and are placed in the databases, then any other genome project or metagenomics project that relies on the databases to identify their genes will immediately benefit from that inclusion."
For many newly sequenced genomes, the issue of detecting small genes isn't that much of a concern, he said. "For any new genome, you're likely to get between 20 percent and 35 percent of genes for which you don't know what the function is. And these are, compared to the genes in our list, fairly large. We're talking genes with 150 amino acids, 200 amino acids, and so on.
"So that problem, I think, is seen as more important than one in which you want to make sure you detect every little gene that may be coded by this organism," Setubal said.
On the other hand, he said that researchers shouldn't overlook these genes just because they're small. "I think there's an instinctive correlation here between importance and size. We have to be careful with that," he said.
Citing the massive amount of progress that has been made in the field of small RNA research, Setubal said "we shouldn't make the mistake of thinking that small proteins are second-class citizens or third-class citizens." It's possible, he added, "that you may have small proteins in cells doing very important work."
The study's findings could also help developers improve the sensitivity of gene-prediction programs, he said. "Our paper provides a set of candidates that people who are working on these gene finders can now use and check their algorithms and determine whether the treatment of the statistical properties of these genes can be somehow tweaked so that the threshold can be lowered."
Borodovsky said that the study's findings are not "a big concern" for current gene-finding methods but said that they are helpful. "It is good to be conservative and avoid false positives and then perform a clean up like this once in a while to account for missed real genes."
Furthermore, "as the number of sequenced genomes grows, missed short genes can be detected by collecting conserved-in-evolution short ORFs — exactly what the project was about."
Setubal said that the study's findings only apply to prokaryotic gene prediction because eukaryotic genomes have introns and other features that make gene finding "a completely different ballgame."
In addition, he said, "in prokaryotes you have a much, much bigger molecular diversity than you do with eukaryotes. So if you're comparing mouse to human to fish, you will see a lot of conservation … whereas when you're working with bacteria or archaea, you have much, much bigger diversity in terms of genes, in terms of proteins, and so on."
As a result, he said, "the approach we took to uncover these gene families at heart is the same approach that people are already using for eukaryotic genomes. Whenever a new mammalian genome or vertebrate genome is sequenced, people immediately compare it to the other vertebrate genomes to see which segments are conserved, which are not."
Nevertheless, he said, even though the work has no direct implications for eukaryotic gene finders," it underscores the fact that "you should pay attention to small things that you may be missing" in any genome.