To divine the meaning of the human genome sequence, Nat Goodman proposes new wizardry in the form of a Human Transcriptome Project.
Even with the Book of Life sitting open in front of our eyes, words of wisdom are hard to find.
It’s time to start the Human Transcriptome Project — to pour the same time and energy into demystifying the Book as was put into wresting it from Mother Nature’s grip.
This task will take multiple years and many dollars. And no doubt it will be plagued by the same acrimonious debates we heard throughout the Human Genome Project — big science vs. small, private vs. public.
But the alternative — to leave the Book open without a means for reading it — is too cruel to bear.
I propose the following goals for a Human Transcriptome Project:
(1) Produce a comprehensive list of genes. More precisely: Divide the genome into regions, each of which contains all exons of a single gene.
(2) Identify full-length sequences of all transcripts.
(3) Determine where, when, and under what conditions each transcript is expressed.
(4) Measure expression levels of all transcripts under a variety of conditions.
Like I said, this is going to be a pricey and painful process.
Gene is the magic word in modern biology. It’s a highly nuanced noun like “truth.” Ten years ago, it commonly meant “genetic locus” — a region of the genome linked to a disease or other phenotype. Over time, biologists became more comfortable thinking of a gene as a transcribed region of the genome that results in a functional molecular product.
In its published human genome paper, Celera defines a gene as “a locus of cotranscribed exons” in order to emphasize the importance of alternative splicing. Ensembl’s Gene Sweepstake Web page takes the definition to new depth: “A gene is a set of connected transcripts. … Two transcripts are connected if they share at least part of one exon in the genomic coordinates.”
Implicit in the new definitions of a gene is a belief that the genome can be partitioned into regions such that all exons in a given region belong to a single gene. These regions are the “loci” of Celera’s definition. A theoretically possible alternative is that the genome might contain long chains of overlapping transcripts in which the first transcript overlaps the second which overlaps the third, but the first and third don’t overlap. I’m not aware of any such examples, but if they exist, then all bets are off.
Eenie, genie, minie, moe
Like many, I am struck by the small number of genes reported by the two sequencing efforts. Their 30,000-gene estimates are surprisingly low, because the EST databases suggest a much higher number. UniGene has about 60,000 human EST clusters, excluding clusters that contain just single ESTs; Incyte and Human Genome Sciences claim even more.
I’m not convinced by the arguments in the genome papers. Neither team seems to have analyzed enough ESTs to decisively explain how so many EST clusters boil down to so few genes. Someone ought to do this analysis. A contrary result would be a real headline grabber, while a result confirming the low estimates would help characterize the pitfalls in EST sequencing. Of course, this is easier said than done, which is why the two sequencing teams didn’t get around to it.
The flip side of the low gene number is the greater-than-expected importance of alternative splicing.
The public project analyzed two well-studied regions and reports 2.6 to 3.2 splice variants per gene with 59 percent of genes having more than one. It also suggests that these numbers are probably underestimates.
Celera emphasizes the significance of this phenomenon in its gene definition: “A single gene may give rise to multiple transcripts … by means of alternative splicing and alternative transcription initiation and termination sites.”
Conventional wisdom used to be that alternative splicing was a second-order effect. Scientists felt that it was okay to worry about genes first and think about splice variants later, if at all. People happily talked about the sequence of a gene as if it were a singular thing. Even more telling, folks cheerfully ran zillions of microarray experiments, without worrying about the splice variants of the genes printed on their chips. If most genes really have multiple splice variants, this attitude has to change fast.
The next mind bender is to discover how many of the theoretically possible transcripts actually occur in nature. Consider a gene with 10 exons. If we assume simplistically that a given exon must be completely present or completely absent from a transcript, one can construct 210 (or 1,024) transcripts from these 10 exons. It won’t be so hard to cope if only a few of these theoretically possible transcripts actually arise. If the number turns out to be a lot bigger, say half of the possible transcripts, then we’ll need some real magic to unravel the mess.
Getting full-length transcripts is itself a hard job. The vast majority of full-length transcripts in the sequence databases today were produced by individual scientists working in traditional small laboratories. You know what this means: Grad students and post-docs busting their butts! Large-scale, systematic sequencing efforts have had little effect to date.
Japan’s RIKEN Genomic Sciences Center is trying to change this through a pioneering project to sequence and annotate a large number of full-length mouse transcripts. A snapshot of the center’s results, published in Nature February 8, 2001, highlights the promise and difficulty of the approach. RIKEN sequenced 21,076 clones using methods that were carefully crafted to minimize redundancy and maximize the number of full-length results. Despite these precautions, only about 60 percent of clones were found to be unique, and only about 60 percent were full-length.
The Japanese team annotated its sequences using state-of-the art methods (which, by the way, are nicely described in the paper) and was able to successfully annotate about 55 percent of the sequences.
I randomly sampled 10 entries from RIKEN’s FANTOM database to get a sense of what their data look like. Only two of the 10 sequences were clearly full length, but four others were “long” and had strong matches to other Genbank sequences. These six sequences were all successfully annotated, while none of the four short ones was. Of the six successful annotations, three identified matches to genes having some degree of biological characterization; of these, one was a match to a known mouse gene while two represented new mouse genes that were homologs of human genes.
RIKEN’s experience shows how hard it is even for an elite team to get full-length transcripts. On the bright side, though, the results also suggest that if one can get full-length or long sequences, they can be successfully annotated.
Sorcerers and Sages
The hope going forward is that we can use the genome sequence as a crutch to find full-length transcripts more easily. I quickly tested using the genome to reveal new splice variants of a known gene. I worked with human caspase-1 (CASP1) for which five splice variants are reported in the literature.
The UniGene cluster for CASP1 contains 46 ESTs. I aligned these against the known CASP1 transcripts and found two that didn’t fit. I followed up one of these, AV713637, using Jim Kent’s Genome Browser to compare the alignments of the known CASP1 transcripts and this EST against the genome.
Lo and behold, the EST contains an intron in a place where the existing ones don’t. Of course, I can’t tell whether it’s a real intron or just a deletion in the EST, and sadly for the new intron hypothesis, the sequences at the putative splice junction don’t match the most common consensus. Still, it’s clear that having the genome is a big help.
Rosetta published a spectacular way to use the genome as a crutch (Nature, February 12, 2001). The company constructed custom microarrays containing oligos covering large numbers of predicted exons and assessed whether cells contained transcripts corresponding to these oligos. While this method cannot directly determine which full-length transcripts exist, Rosetta reports that it can compute this by clustering exons on the basis of genomic location and expression pattern.
The method successfully detected 85 percent of known genes on chromosome 22q, and detected at least one exon in 57 percent of genes that were predicted by Genscan without any prior experimental evidence. Genome wide, the method detected 58 percent of exons that Ensembl lists as confirmed by experimental evidence, and 34 percent of exons listed as predicted without such evidence. These numbers are impressive and suggest that the combination of gene prediction and microarray confirmation could be a practical way to explore the transcriptome.
Ultra-high-throughput sample sequencing, such as SAGE, is yet another strategy for identifying transcripts. The basic idea is to sequence huge numbers of small snippets and use computation to glue these back into a picture of the transcriptome. This hasn’t caught on, but may become more practical now that we can use the genome sequence to interpret the data.
There’s a lot of work here, but if we are to become fluent in the Book of Life it must be done.
What’s more, the Human Transcriptome Project has to be ignited in the public sector, just as the Human Genome Project was. Once underway, investors will open their Book of Checks to companies seeking to create the technologies required by the project, as well as companies striving to wring commercial value from the effort as quickly as possible.
And as the public effort nears the finish line, it will draw late entrants who will wave better wands and fly past the pack. Though a bitter potion for the hard-working sages of the public effort, it’s just what the wizard ordered for the billions of people who are clamoring for our success.
Websites of the Wise
Ensembl Gene Sweepstake
RIKEN’s FANTOM homepage
Jim Kent’s Genome Browser