A new gene prediction program from Michael Zhang and his colleagues at Cold Spring Harbor Lab could be an effective tool as biologists shift their focus from coding regions of the genome to regulatory regions, according to Zhang.
The program, called FirstEF, is able to detect non-coding first exons a class of gene segments that has been very difficult to detect by conventional methods that rely on protein coding patterns.
Instead, Zhang said, the program relies on a set of discriminant functions that can recognize structural and compositional features such a CpG islands, promoter regions, and first donor sites.
In an initial analysis against human chromosomes 21 and 22, the program correctly located 86 percent of known first exons at those chromosomes with a 17 percent false positive rate twice the sensitivity as the commercially available PromoterInspector program from Genomatix at the same specificity level, Zhang said.
Most current gene-finding programs, such as Genscan, Fgenes, and MZEF, are able to easily identify the internal coding regions of a gene and are ultimately designed to predict whole genes. But Zhang said by focusing on exon prediction, FirstEF avoids errors that may arise due to the lack of knowledge about alternative splicing. There are a lot of things about alternative splicing that we don''t understand right now, so there are going to be a lot of mistakes if you try to force those splice exons together to form a gene model, said Zhang.
By classifying all the possible exons into exclusive classes such as internal coding exons, untranslated exons, first or last exons, and the like Zhang said he was able to develop detection tools specific to each type of exon.
When people talk about gene finding, almost everybody is talking about finding the coding regions, Zhang said. So all the current gene finding programs start with ATG the translation start site and end with the translation stop site.
While this has served the needs of biological research until now, Zhang said that only recently has enough genomic sequence become available to understand the non-coding regions of the genome that control the regulatory elements. These regions have functional elements that determine when, where, and how much the gene should be expressed, he said.
With the recent boom in microarray experiments to study gene expression, Zhang said interest in accurately predicting these regulatory elements should grow as well.
To train their model, Zhang and his colleagues created a database of first exons and promoters of 2,139 known genes by mapping full-length 5'' untranslated regions to their genomic sequences. Based on this data, the team estimated that approximately 40 percent of human genes have completely non-coding first exons.
The team ran FirstEF on the assembled sequence of the human genome from the UCSC working draft and was able to predict 68,645 first exon clusters. While acknowledging that the number of exons is not indicative of the number of genes, Zhang didn''t hesitate to throw his hat in the human gene tally ring: His guess for the total number of genes is between 50,000 and 60,000.
A paper on the program will appear in the December issue of Nature Genetics, at which time it will be available for download from www.cshl.org/mzhanglab. Zhang said the program would be free to academic users, while commercial users will be able to license it from the CSHL licensing office.
Reaction to the new program has been positive, although few have had access to it so far. Jim Kent of UCSC said he hopes to add a FirstEF track to the UCSC Human Genome Browser. Mark Borodovsky of the Georgia Institute of Technology, who wrote one of the first gene-finding programs, GeneMark, in the mid-1980s, said he was unfamiliar with FirstEF, but remarked, finding the first exon is very important and has been a difficult problem in the field. Any significant progress in this area is good news.
Others raised the question of whether the prediction of 68,645 first exons in the human genome may be due to a high false positive rate. Zhang did note in his paper that he expects approximately 1,586 CpG-related and 11,601 non-CpG-related false predictions out of this total, but added, This number of false predictions might be small enough for experimentalists to test all of the false positives for expression.