I was shocked when I heard the news. Long-Held Beliefs Are Challenged By New Human Genome Analysis, screamed the New York Times that fateful day in February. Celera and the [International Human Genome Sequencing Consortium] each say they can find only 30,000 or so human genes.
If only 30,000 genes were found, then 70,000 were missing! Had the murderous Moriarty or dastardly Dr. Evil stolen these gems? Or was it all just a big misunderstanding?
This was a case worthy of the great Sherlock Holmes or Austin Powers. But as neither of these expert crime fighters was available, I had no choice but to take the case myself.
Join me as I explore this mystery and learn once again that truth is stranger than fiction. Whats really shocking is not the number of genes, but the astonishing discrepancies between EST and genomic datasets.
Scientists long suspected that the human genome contained 100,000 or more genes. The main evidence for this comes from EST clusters. UniGene, the gold-standard public database of clustered ESTs, contains 95,569 clusters (as of build #133). Two big commercial EST houses Incyte and Human Genome Sciences claim to have evidence for even more genes 120,000 and 140,000, respectively.
The challenge is to reconcile the large number of EST clusters with the much smaller number of genes found in the genome sequence.
Clearly, some ESTs are counterfeit. The biggest concern is contamination from genomic DNA or unspliced mRNA that somehow leaked into the EST sequencing process. Though no one can give me a solid estimate for how often it occurs, most agree its common.
Contamination presumably occurs at random, and the conventional remedy is to distrust EST clusters that contain just single ESTs (called singletons). Of the 95,000+ clusters in UniGene, 35,197 are singletons, leaving 60,000 or so with multiple ESTs. Thats still twice the number of genes found by the genome sequencers.
Armed with the genome sequence, we can further check whether an EST is spliced or unspliced. As most ESTs are longer than typical exons, most should come from spliced transcripts and should match the genome in multiple pieces. ESTs that only match the genome in a single chunk are more likely to be bogus.
We can get more clues by looking for ESTs that match widely separated regions of the genomes, such as different chromosomes. This may reflect legitimate biology, such as conserved gene families. Or it may finger an EST as containing a repetitive element.
Marty Gollery of TimeLogic gave me a super clue about checking for phonies. NCBIs FTP site maintains a file of representative UniGene sequences, which contains the longest high-quality sequence from each cluster. Marty BLASTed these sequences against the genome using TimeLogics hardware-accelerated Tera-BLAST program.
I analyzed Martys results using a simple R program and found that 30 percent of sequences hit multiple chromosomes. Of the sequences that hit only one chromosome, about 45 percent seem to be unspliced, because they match just one spot on the chromosome. This leaves about 30,000 sequences that look like genuine, spliced transcripts based on this very simple analysis. This number is suspiciously close to the number of genes found by the genome guys, but Im sure its just a coincidence.
Another factor that affects the count is under-clustering, in which two or more EST clusters really come from the same gene but the clustering process missed the overlap. This might happen because the sequences come from different parts of the gene, or because the overlap is too small to be caught by the clustering software. Subtle biological effects, such as alternative splicing and alternative transcriptional termination, make this problem worse.
Theres a nice discussion of under-clustering and other gene counting problems in a June 2000 Nature paper from TIGR which, by the way, estimates 120,000 genes based on the data then available.
There are also plenty of reasons to question the number and quality of the genes found in the genome. Gaps and errors in the sequence assembly are obvious sources of trouble.
Genome annotation is still a young field, and the methods are imperfect. Two recent papers make this point rather forcefully. Terry Gaasterland and colleagues writing in Nature Genetics found more than 1,000 genes that had been missed by the much-heralded jamboree that annotated the Drosophila genome. Sam Karlin and associates in Nature found that many genes annotated by the jamboree (45 percent of those checked) differed by more than one percent from their previously published sequences.
Casual surfing of the genome reveals many cases in which predictions made by Ensembl, Affymetrix (actually the Neomorphic guys using Genie), and Softberry all differ.
Scouting the crime scene
To get to the bottom of the case, I decided to inspect the evidence myself.
I used Jim Kents Genome Browser to examine the public data. I found Kents browser to be more effective than the Ensembl or NCBI browsers for displaying long lists of aligned mRNAs and ESTs in a region.
I also used Compugens Gencarta, which the company kindly provided for this purpose. Raveh Gill-More at Compugen assisted me in the use of their product.
The Gencarta people provided me a list of 100 EST clusters chosen at random. I studied a few that looked interesting.
AA000972. Gencarta showed eight clusters across a 10kb region. It is implausible that these could all represent unique genes, suggesting considerable under-clustering. Raveh pondered these for me and observed that the eight clusters contained numerous ESTs on both strands, suggesting that the region might contain two overlapping genes on opposite strands. An alternative explanation is that the orientations of half the ESTs are wrong.
Kents browser showed two abutting, but not overlapping, genes on opposite strands: TC10 and PIGF. The EST in question and many others fell inside an intron of TC10, and were sprinkled like footprints across the whole length of the intron. The browser also showed several spliced ESTs spanning the region on both strands.
Sadly, this confused picture is typical of what I found throughout my investigation.
AA058952. Gencarta showed five clusters in a 10kb region suggesting under-clustering again. Kents browser showed a known gene, APG5L, ending about 15kb upstream, and numerous ESTs associated with that gene. Interestingly, many of those ETSs were reported on opposite strands.
AW881268. This one started out looking legit, but ended up as another mess. Gencarta showed this as a spliced EST in a cluster with a second EST that was also spliced. The two ESTs represented alternative splice forms.
UniGene placed the EST in the same cluster as hypothetical protein FLJ10979, whose transcript was sequenced recently by the NEDO cDNA sequencing project in Japan. On Kents browser, it didnt look like the exons of the EST overlapped the exons of this gene, so I wondered whether the EST had been clustered correctly. I downloaded the sequences from the UniGene cluster, and used Webb Millers sim4 alignment program to find the sequences that linked AW881268 to FLJ10979. Lo and behold, the linking sequences were present in UniGene but missing from Kents browser.
Kents browser also showed a pile of ESTs near one end of our EST that UniGene reported as belonging to another gene, RPS4X, which maps to a different chromosome.
W04948. Gencarta showed this as a spliced EST with several exons in a cluster containing three full-length mRNAs and 25 ESTs. These sequences yielded three different splice forms. There were also 12 other clusters in the general vicinity indicative of under-clustering.
Kents browser showed it as an unspliced EST. There were two non-overlapping genes nearby with a trail of unspliced ESTs in the intergenic region between them. There were also lots of spliced ESTs nearby. It looked like the EST overlapped a full-length mRNA that UniGene said belonged to SMARCE1, but the EST itself was in a different UniGene cluster. This might be another example of under-clustering.
BF380844. Gencarta showed this EST overlapping with a cluster named T05370. T05370 turned out to be a huge cluster containing three full-length mRNAs and 461 ESTs with 100 different splice forms.
Kents browser did not contain the original EST, but it had plenty of ESTs from the overlapping cluster. UniGene reported this cluster to be a known gene SMARCA4. The data for this cluster sparkled with spliced ESTs arrayed along the length of the gene defining alternative splice forms.
But even this great example was flawed. Nestled inside an intron was an EST, C01338, which UniGene clustered with a different gene, FPR1, on a different chromosome.
More dirty data
I did similar analyses on 10 sequences chosen at random from NCBIs list of representative UniGene sequences.
Each sequence had some serious problem, but I wasnt able to group the errors into neat categories. A lot of ESTs fell into introns of other genes. Are these bogus unspliced mRNAs or legitimate alternate splice forms? A lot of ESTs were close to each other on the genome, but didnt overlap. This smells like under-clustering.
Several of the UniGene ESTs were not present in Kents browser, though they could be found by searching the genome sequence using Kents BLAT tool. Conversely, many of the ESTs visible in Kents browser were not present in UniGene.
In one case, AA748182, there were four EST clusters in a 30kb region, two of which belonged to different known genes, ENAM and SLU7, while the other two were just hanging out. If the data for ENAM and SLU7 are correct, this shows that genes can really live very close to each other. Hmm. Maybe we shouldnt be so quick to blame under-clustering when we find EST clusters close to each other.
Its a tough case, all right. The EST and genome datasets are both a bit shady. It will take skilled detective work to sift the real diamonds from the paste.
If I were using these data for real, I wouldnt trust any of the databases. Id start from scratch: reassemble the genome in my region of interest, validate every EST and mRNA sequence that falls in that region, and re-run all the gene predictions. Its a lot of work, but at least Id know what evidence to believe.
EST Sleuthing Tools
Jim Kents Genome Browser and BLAT search tool
NCBIs file of representative UniGene sequences