Computational gene prediction is known to be an uncertain process, but it turns out that deciding when — and if — a gene is indeed verified is a bit of a black art itself.
This June, Burlington, Mass.-based AnVil entered a collaboration with Applied Biosystems to experimentally validate genes in the Celera Discovery System that were predicted by computational approaches. John McCarthy, director of discovery informatics at AnVil, said that the company has so far gotten through around 8,000 of the predicted genes, and is about one-third of the way through the total set.
How’s it going? Well, it turns out that answering that question is trickier than it may seem.
AnVil is using microarray and Taqman experiments to verify the predicted genes via expression experiments using probes and PCR primers supplied by ABI. But the success rate so far is difficult to assess because “verification is questionable,” said Pat Hoffman, an AnVil scientist working on the project. “Probes can be verified, but you don’t know if the whole gene is verified. If you have three to four probes per gene and 50 percent are verified, it doesn’t mean the whole gene is present.”
In addition, he said, “it’s not an apples-to-apples” comparison between the microarray-based and Taqman-based methods, which don’t check the same regions of the gene. The correlation between the two techniques is around 70 percent, Hoffman said.
As a ballpark estimate, Hoffman guessed that AnVil has seen a validation rate of around 30 percent for the genes studied so far.
The best way to validate the computationally predicted genes would be a gene-by-gene approach, which would be unfeasible for the large set of predicted genes to be verified. The AnVil/ABI approach, as uncertain as it may seem, is the best high-throughput method available for confirming the existence of genes with no biological evidence, McCarthy said.
ABI estimated that the unconfirmed genes with almost zero traces of evidence make up between 10 percent and 20 percent of the human genome in CDS.
“A lot of work still needs to be done in trying to predict genes from sequence. It’s not a solved problem,” said Hoffman.