NEW YORK--With the sequencing portion of human genome analysis nearly complete, estimating the number of genes it will contain has become a popular pastime for the genomics set.
Following heated debate at this year’s Cold Spring Harbor Genome Conference, a Gene Sweepstakes web page was established ( http://www.ensembl.org/genesweep.html )
to allow scientists to place bets on the number of genes that will ultimately be found in the human genome.
Illustrating the wide range of guesses, three papers that are set to be published in Nature Genetics this month explain different methods for arriving at the total gene count and offer estimates ranging from 28,000 to 120,000 genes.
Although many scientists who have placed their bets admitted that to an extent their exact guesses are random, they said that the methods they used to arrive at their ballpark figures are sound. There are two major differences among researchers’ methods: one is in the way candidate gene sequences were selected; the other is in the software that was used to analyze them.
Offering an estimate at the high end of the range was John Quackenbush of the Institute for Genomic Research in Rockville, Md. Quackenbush reported in his Nature Genetics paper that when all is said and done, the human genome would contain about 120,000 genes. His team started with nearly 1.6 million expressed sequence tags from the US National Center for Biotechnology Information’s dbEST, a GenBank database that contains sequence data and other information on single-pass cDNA sequences. “We cleaned them, grouped them into clusters based on sequence similarity, and assembled them to produce consensus sequences,” Quackenbush said. In the end, Quackenbush’s TIGR team was left with 1.1 million ESTs in 73,655 assemblies, which he said was a good starting point for the analysis.
TIGR arrived at two independent estimates for the number of genes. Both were about 120,000. For the first, the group compared EST data with the number of annotated genes in GenBank. “What was surprising to us,” Quackenbush said, “was that approximately 45 percent of the annotated genes do not have EST hits.” He extrapolated from the 73,655 sequences to arrive at about 134,000 genes. Accounting for possible redundancy, the estimate falls to 110,000 and the average of the two is 122,000.
In the second method, TIGR scientists searched chromosome 22, the sequence of which was published last December, and found a large number of genes that do not appear in the published annotation. “By extrapolating, taking into account the chromosome size, the relative gene richness based on EST mapping data, and potential redundancies in our dataset,” Quackenbush said, his group arrived at an estimate of 118,000 for the entire genome. He said the team used CAP-3 assembly software, which is made by Pasadena, Calif.-based Paracel.
Phil Green of the University of Washington, Seattle, generated the second estimate that will be published in Nature Genetics. Green used his own software to reach an estimate of only 35,000 genes. Green’s software suite includes the programs Phrap, which assembles shotgun DNA sequences, and Phred, which reads DNA sequences, calls bases, assigns value to the bases, and writes base calls and quality values to the output files. The software is licensed and sold by CodonCode of Dedham, Mass.
Green says other groups have higher estimates because they are based simply on counting EST contigs, which gives “misleadingly high results because of the many artifacts in EST datasets and because there can be several EST contigs” for a single gene. After elimination of low-quality sequences, Green started with analysis of 992,353 ESTs.
The third method, published by Jean Weissenbach and colleagues at Genoscope in France, compared the human genome sequence with that of the pufferfish Tetraodon, and extrapolated based on evolutionary conservation. The team sequenced about one-third of the Tetraodon genome and used homology searches called Exofish to reach estimates of about 28,000 to 34,000 human genes.
One reason the range of estimates may be so wide is that the definition of the gene is changing. Temple Smith, a professor of biomedical engineering at Boston University and an expert in applying computer and math models to gene sequencing, told BioInform: “It is important to note the old gene definition of one protein product to one DNA region or gene just does not work. Even the more general definition of one contiguous DNA region that encodes one or more proteins sharing at least one common coding exon wouldn’t quite do.” He said that because of alternate splicing and initiation points along DNA, there can even be genes within genes.
As for those who have chosen to hedge their fortunes on winning the GeneSweep, the rules are simple: It costs $1 to make a bet this year, $5 in 2001, and $20 in 2002. The method used to determine what a gene is will be voted on at the 2002 Cold Spring Harbor meeting, and the number of genes will be assessed in 2003. The winner gets the pot, plus a copy of “The Double Helix,” autographed by author and Cold Spring Harbor Laboratory president James Watson