Researchers at Cornell University recently discovered roughly 300 previously unidentified human genes and several hundred extensions of already known genes. The group, led by Adam Siepel, assistant professor of computational biology and biological statistics at Cornell, used three specially scripted algorithms to compare alignments among human, rat, chicken, and mouse genomes in varying configurations in order to identify conserved genes.
The research project demonstrated the value of a computational gene- finding approach over traditional sequencing methods using mRNAs or cDNA libraries, which can sometimes miss genes expressed at lower levels or only in certain tissues or particular stages of development. “With our comparative computational approach, you do not rely at all on the presence of mRNAs that you sequence by these random methods,” says Siepel. “Instead, you look for statistical signatures through comparative sequence analysis to find regions that are evolving in gene-like ways by comparing the human genome and these other mammalian genomes.”
According to Siepel, the biggest challenge the researchers faced during the three-year study is that the set of known genes is always a moving target. “Every day you go back to the database and there are new genes in there, and so we had to work out a fairly complicated way of assessing novelty by comparing what we had done to the database of known, publicly available genes,” he says. Some of the genes they found are involved in motor activity, cell adhesion, and central nervous system development.
Siepel says that this computational approach has broad applications. He is now using the same approach to identify single exon genes systematically across the genome using currently available comparative sequence data. “If there are missing human genes, there is a good bet that a lot of them will be these single exon genes because of these challenges in identifying them,” he says. “We are using these comparative methods to identify single exon genes systematically across the genome with comparative sequence data we already have.”
The group is now focused on identifying functional sequences that are not protein coding genes, as well as genes that have been gained or lost in different species.