While the hotly anticipated sequencing papers of Celera and the International Human Genome Project offered few surprises for many in the industry, their consensus that there are fewer human genes than anticipated did leave some scratching their heads.
The finding that there are only 26,000-40,000 genes raised a number of questions regarding the accuracy of the gene prediction algorithms used by each side, as well as the validity of commercial gene databases that claim to offer significantly more genes. Incyte Genomics, for example, has claimed it offers 120,000 human genes in its database, while Human Genome Sciences says it has identified 100,000 human genes.
The actual gene count may fall somewhere in between, according to researchers from Rosetta Inpharmatics and Ohio State University, who separately tallied the genes in chromosome 22 using new methods that they said overcome the limits of the bioinformatics approaches used by both Celera and the HGP.
Rosetta, located in Kirkland, Wash., published a paper in Nature last week that demonstrates its new gene discovery technology, Genecipher. The technique, a combination of ink-jet microarray technology and proprietary bioinformatics, confirmed the existence of 567 “expression-validated genes” on chromosome 22, the company said.
The public project’s initial report of the sequence of chromosome 22 in 1999 identified 545 known and predicted genes. Since then, the HGP has confirmed the existence of only 17 additional genes in the chromosome, and has merged some previously identified gene predictions.
Celera’s gene prediction algorithms estimated chromosome 22 to contain between 147 and 835 genes.
“Craig Venter and the Human Genome Project are saying we have 30,000 genes, but they don’t really know that. It’s an estimate from computational analysis,” said Ian McConnell, a Rosetta spokesman. “Whereas this is the technology that is really going to identify what are genes and what aren’t, what is expressed and what isn’t.”
The Rosetta technique first fabricates “exon arrays” of oligonucleotide probes derived from predicted exons. Expression data from the exon arrays is used to identify genes through co-regulated expression patterns. Rowan Chapman, director of business development at Rosetta, said that this is the first technology to use expression data to determine gene structure.
Gene prediction by computational analysis alone is inadequate, according to Chapman. While she admitted that Celera’s and HGP’s total number may be in the ballpark, she said that the predictions are highly inaccurate and contain an “extremely high” level of false positives and false negatives that essentially cancel each other out.
Chapman said that Rosetta is “highly confident” in Genecipher’s ability to accurately identify genes as well as gene variants. She said that Rosetta expects to arrive at a definitive gene count for the entire human genome by the end of the year and that commercialization plans for Genecipher would be announced in the next couple of weeks.
Another alternative approach to gene determination appeared in a paper published last week in Genome Biology by Bo Yuan and his colleagues at Ohio State University.
Yuan also criticized the wholly computational approach taken by Celera and the HGP to predict genes. “I think the public consortium should have used more information to confirm genes,” he said. “Their information is not complete.”
The OSU team integrated publicly available transcript data, protein data, and mapping information and used three different gene predicting programs to estimate the existence of 854 genes on chromosome 22 and a total of 65,000-75,000 genes in the entire genome. Like Rosetta, Yuan said they focused on transcript and exon data to establish the existence of the genes.
Using the public data, Yuan said the OSU researchers determined that the total base pair length of exonic sequence comprises 4 percent of the genome, rather than the 1-1.5 percent that Celera and the HGP found.
“We used the entirety of the transcript information, compared to dbEST, which was the only database used in the public consortium,” Yuan said. “The difference is that even though dbEST might contain most if not all of the information for transcript representation in the genome, the shorter ESTs are not as efficiently placed in the genome.”
Yuan said that current gene prediction programs have a 30% false negative rate and a “very high” false positive rate. “The best approach,” Yuan said, “is a pre-assembly of the transcript information, which is then placed in the genome to use genomic landmarks to derive full-length cDNAs. You can’t just use a program.”
The claims of Rosetta and the OSU team won’t sway the HGP, however. “We’re very strongly standing by our figure,” said Tim Hubbard of the Sanger Center. “It’s very easy to be fooled by fake ESTs and matches that just aren’t real. There are also quite a lot of pseudogenes.”
“Chromosome 22 has been studied very carefully experimentally,” Hubbard said, “and of course there have been quite a lot of different groups coming along saying they’ve got more evidence that there’s some more genes, and in the cases where we’ve looked at that data, none of it’s stood up at all.”
Hubbard noted that OSU’s paper had yet to be peer-reviewed.
But rather than dwell on the potential limits of current bioinformatics tools to accurately predict the number of genes, others in the industry speculated about the best way for bioinformatics to exploit a relatively gene-poor genome.
Jean-Michel Claverie, of the CNRS Structural and Genetic Information Laboratory in Marseille, France, wrote in a paper that appeared in Science that the low gene count could signal the “beginning of the end” of genomics, since the lower number of potential drug targets could significantly shorten the life span of the industry.
However, the low tally was good news for those already developing post-genomic bioinformatics tools. “It seems almost too good to be true,” said Leigh Anderson, president of Large Scale Biology’s proteomics subsidiary in Rockville, Md.
Proteomics is one area positioned to blossom as it becomes clear that genomics alone will not paint the full portrait of human biological complexity. Celera has already announced plans to accelerate its proteomics efforts, along with a number of other companies, and the timely launch of the International Human Proteome Organization occurred the week before the sequence papers were published.
“I’m doubtful that [the actual number of genes] will turn out to be such a low number,” said Anderson, “but I’ll be very happy if it is.”
Fewer genes means that the number of resulting proteins is less daunting, Anderson said. In addition, he said, it confirms that diseases and drug effects can now only be attacked directly at the protein level.
The shift toward proteomics should stimulate the development of more quantitative analysis software as opposed to the sequence structure tools that have comprised the bulk of bioinformatics until now, Anderson said. “That side of bioinformatics is exploding,” he said, “because we’ve really got to understand how all of the fine control is achieved in these biological systems and that’s going to be found out by making millions of quantitative measurements.”
Large Scale Proteomics has already started implementing new methods to search its mass spectrometry-based protein characterization data against the whole genome, Anderson said.
Others don’t see proteomics as the next logical post-sequence step, however. Said Rosetta’s Chapman, “Proteomics tools just aren’t ready yet.” Instead, she said Rosetta would first focus on obtaining an accurate and complete picture of the expression pathways that they discover using Genecipher. Rosetta’s technology “allows you to identify more than one transcript coming from one gene,” said McConnell, “so that could be the next level of complexity and not the protein.”
Members of the Biopathways Consortium are gearing up for an anticipated jump in focus toward the complex interactions between gene products. “Complexity is not measured linear to the number of genes or proteins,” said Eric Neumann, a BPC founder. “30,000 genes are still a lot if you don’t know really how they are each regulated and how they ‘work together.’”
Vincent Schachter, another BPC founder, added that Claverie’s “beginning of the end” scenario “assumes, among other things, that leads will be developed ‘for’ candidate genes — instead of, say, proteins, or even specific domains of specific proteins. If one doesn’t make that assumption, then it is not clear to me why the number of genes should significantly impact target and lead discovery.”
Neumann said the consortium is willing to support pathway and interaction informatics, “but it will depend on what researchers in academia and industry decide is the next main set of tasks for them.”
But while some are busy determining exactly what those tasks will be, OSU’s Yuan isn’t ready to abandon the genome sequence yet. He said the next step for bioinformatics should be the creation of a comprehensive gene index using a global approach that synthesizes all the currently available computational and experimental gene prediction methods.
Yuan envisions the draft of the human genome sequence as “the single arbiter” that will “be the single consensus for everything else” — whether it contains 30,000 or 100,000 genes.