Headlines across the world trumpeted the high degree of similarity between mice and men following the publication of the mouse genome sequence in Nature last week, although the data itself offered few surprises for the genomics community — after all, it’s been publicly available since May. But buried within the densely packed 45-page paper, a new family of bioinformatics tools quietly made its debut. Three new programs for dual-genome de novo gene prediction played a role in tallying up the final gene count for mouse and, in the process, verified that despite initial skepticism, the Human Genome Project’s estimate of 30,000-40,000 protein-coding human genes was fairly accurate.
That’s not to say that all uncertainty has been wiped out of the ongoing gene count debate. The mouse paper’s authors pointedly note several times that the computational process “remains imperfect and the predictions are tentative.” The international Mouse Genome Sequencing Consortium assessed a wide range of available gene prediction tools for its analysis, and settled on a combination of approaches that met the modest-sounding goal of providing “an improved catalogue of mammalian protein-coding genes.”
The degree of improvement is still a bit fuzzy, however. Ewan Birney, who oversaw much of the gene prediction process, was quick to note that ambiguity remains an inherent part of the process. “It’s not just scientific nicety, saying that we don’t know who’s right or who’s wrong. We honestly don’t know who’s right or who’s wrong,” he explained. But the human genes and mouse genes that are discussed in the paper “are sets that we end up having more confidence in because of the way they’re built.”
How much more confidence? Nobody’s really sure. “It’s just much better,” Birney offered as reassurance.
New Data, New Tools
Roderic Guigó, who developed the dual-genome gene prediction program SGP2, said that such gene finders have been a hot area of development in the last two years, and that around eight similar software tools are available now.
These tools have sprung up since the initial human paper in February 2001, as mammalian EST and cDNA collections have expanded and coverage of the mouse and human genome sequence has increased.
Initially, additional transcript and sequence data gave evidence-based gene prediction methods, such as EBI/Sanger’s Ensembl pipeline and the Genie pipeline from Affymetrix, fresh material to revise their initial predictions. For example, Ensembl predicted 31,778 human genes in February 2001, with an average of 4.2 exons per transcript, but predicted 22,808 genes upon revisiting the count in September 2002, with an average of 8.7 exons per transcript. “So basically, when there is evidence for a gene, we’re basically getting the whole shebang now,” said Birney.
The mouse gene count began with an initial catalogue built with the Ensembl and Genie pipelines. However, these methods can only detect exons that are supported by known transcripts or homology to cDNAs or ESTs in other organisms, raising doubts about how many genes might be missing because experimental evidence doesn’t yet exist for them. GenScan and other de novo gene prediction methods rely on statistical properties of gene features to predict genes directly from sequence data, but are prone to false positives.
But with the mouse genome in hand, the analysis team was able to take advantage of comparative genomics to enhance its predictions. Three new programs for dual-genome de novo gene prediction — TwinScan from Washington University, SGP2 from Universitat Pompeu Fabra, and SLAM from the University of California, Berkeley — were used to identify around 1,000 additional new mammalian genes with a fairly high degree of confidence.
These approaches improve upon the false positive rate of single-genome prediction programs because they delete any gene predictions that are not conserved in both species.
Of these new programs, SGP2 and Twinscan offered several advantages over other cross-genome predictors, Guigó said, which are still “mostly prototypes.”
Guigó contributed an additional level of specificity to the dual-genome prediction process by writing a filtering program to highlight those predicted genes that also exhibited strong conservation of exonic structure between the two species.
These genes were also verified experimentally with reverse transcription PCR. “For those genes in which we found common predictions in both genomes that had exonic structure conserved, they were much more likely to correspond to real genes than gene predictions that only appeared in one genome or that were similar but had no conservation of exonic structure,” Guigó said.
After filtering, an additional 1,000 or so predicted genes were added to the evidence-based set, a finding that Guigó considers surprisingly small. “When we started this project, we thought that the comparison with the human genome sequence was going to lead to the discovery of a substantial fraction of yet-to-be-known genes. But the truth is that by this analysis, we have support for only about 1,000 potential more novel genes.”
Birney noted that the dual-genome gene predictors add a new layer of confidence to the initial mammalian gene estimate, which now appears to be closer to 30,000 than 40,000. The results mean “there’s just fewer places for genes to hide,” he said, and that the “dark matter” of the 20,000 or so genes that many assumed to be unaccounted for seems far less mysterious.
Now the Bad News
Despite their renewed confidence in the catalogue of mammalian protein-coding genes, both Birney and Guigó stressed that many challenges remain in the field of computational gene prediction. Single-exon genes, for example, remain unaccounted for. There’s also the chance that there may be a family or families of human-specific genes that the dual-genome approach won’t catch because the sequence would not be conserved across the two species.
In addition, pseudogenes — “fossils” of dead genes that still resemble live genes structurally — are a huge thorn in the side of genome analysis. Birney estimated that around 3,000-4,000 genes in the final count may be pseudogenes, and “it’s actually very hard to definitively say whether something’s a pseudogene or not.”
RNA-coding genes present an entirely different challenge that the mouse analysis team barely touched upon in the mouse paper.
Guigó pointed out an additional problem that he plans to address in future research: Current gene prediction programs can’t detect genes that code for proteins that contain the so-called “21st amino acid,” selenocysteine, because the codon for the amino acid is the same as the stop codon TGA. “And of course the next problem,” he added, “is alternative splicing.”
“We can’t think we have found all the human genes yet,” Guigó remarked. “After the mouse genome, we start having a good picture of the set of human genes, but this is still incomplete.”
Guigó has just begun a collaborative project with Michael Brent from Washington University and Stylianos Antonarakis from the University of Geneva “to get RT-PCR evidence for almost all human genes.” In addition, a paper describing SGP2 will appear in Genome Research in January, and a paper describing the computational/ experimental/filtering protocol that he developed is in press at PNAS. Papers describing Twinscan and SLAM are also in press at Genome Research.
Additionally, the full results of the SLAM analysis are available at http://bio.math.berkeley.edu/slam/mouse/.