NEW YORK, Aug. 24 – Researchers at two California-based labs have contested the widely cited size of the human genome, suggesting in a letter to a scientific journal on Friday that the actual number of human genes may be significantly greater than the estimated 30,000 sequenced and published by Celera Genomics and the Human Genome Project.
Writing in a letter to the editor of Cell , a group of scientists led by Michael Cooke and John Bogenesch at the Genomics Institute of the Novartis Research Foundation, together with researchers at the Scripps Research Institute, said a comparison of the two published versions of the human genome showed for the first time that they have only about 16,000 genes in common. Thus, if the two teams of researchers have accurately predicted their additional 26,000 genes, the total number of genes should equal at least 42,000.
The discrepancy highlights the uncertainty inherent in the computer models used to find the locations of genes in the human genome, as well as the limits of scientists’ understanding of the human genome itself.
Before the Human Genome Project released their draft of the human genome, many scientists assumed that the complexity of human life could be explained only by the existence of at least 100,000 human genes. Even today, many scientists insist the total number must be greater than 30,000.
The Novartis team did not hazard its own guess as to the number of genes in the human genome.
In addition to comparing the two sets of gene transcript data taken from Celera and the Human Genome Project, the Novartis-led group also compared the two data sets with a set of reference transcripts, called Refseq, curated by the National Center for Biotechnology Information. The comparison showed that 9,300 genes, or over half of the genes that both the public and private sequencing teams predicted, were found in the Refseq library.
“The 9,300 [gene transcripts] that match Refseq are very accurate and that’s the end of the game right there,” Bogenesch told GenomeWeb. “For the 6,552 additional ones predicted by both groups, we have high confidence in them as well.” Bogenesch added that he has “less confidence” in the rest of the genes the two groups predicted, and that other methods, such as cloning RNA from cDNA libraries or studying gene expression with DNA microarrays, would be required to validate their existence.
However, in a preliminary study described in the letter, the Novartis team tested how many of the genes not found in Refseq but predicted by Celera and the Human Genome Project could be identified in a bank of 13 different human tissues. Using RNA expression profiling, Bogenesch and his colleagues found that about 80 percent of the gene transcripts predicted by the two sequencing teams were found in the tissues.
On the basis of this result, the Novartis team contends that many of the gene predictions are accurate, but taken individually, both the public and private set of predicted genes are incomplete.
“Our initial feeling is that the genome groups have been leaning on the conservative side and we may have an underestimate,” Bogenesch said. “But we really have no idea right now until we look at all of them what the final number is going to be. And because these two groups predicted different genes, who’s to say that a third group won’t [develop] an algorithm and predict yet additional genes in the genome? In the end it’s probably going to take several years for us to have a reasonable understanding of how many genes there are.”
Before submitting their data for publication in a full research article, the Novartis team plans to perform additional RNA expression experiments using 50 to 100 types of human tissues in an effort to validate the gene predictions.