The human genome is one year old. That is, if you believe life begins at publication. I thought it would be nice to drop in to see how the tyke is doing.
I’m pleased to report evidence of some healthy growth in year one. Human genome databases are over the hump on the basic technical issues. Assemblies are in decent shape, and annotations of known genes are technically solid. Now the challenges are scientific: to go beyond technically correct annotations to ones that are scientifically valid.
The proud parents: Celera, Sanger, EBI, and NCBI
There are two distinct assemblies of the public human genome and three complete annotations available to the public. In addition, proprietary assemblies and annotations of the public data are available from Compugen and DoubleTwist. And of course there’s Celera’s proprietary data, assembly, and annotation.
David Haussler’s group at the University of California at Santa Cruz produced the official assembly discussed in Nature, while Ensembl, a joint project between the European Bioinformatics Institute and the Sanger Centre led by Ewan Birney, generated the official annotation. A separate annotation done by Jim Kent, a graduate student in Haussler’s laboratory, has evolved into a major public resource.
Greg Schuler and colleagues at the US National Center for Biotechnology Information carried out a second assembly independently and also performed their own annotation. Their data are now well integrated with NCBI’s LocusLink and RefSeq databases, allowing easy movement between genes and genome.
I examined several example genes and genomic regions using all three public annotations. I also took a quick look at Celera’s product, and queried the database a few times through a human intermediary, Tony Kerlavage, the company’s senior director of product development, who was able to give me annotations but no actual sequences.
The examples I used include caspase-1, -4, and -5 (CASP1, CASP4, CASP5), well known, small genes that lie within 100 Kb of each other on a finished part of chromosome 11. I also looked at neurexin-1, -2, and -3 (NRXN1, NRXN2, NRXN3) — large genes that lie on disparate parts of the genome. My third example is the 1 Mb region between STSs RH43118 and RH9954, which is a largely unfinished region of chromosome 4.
Baby Genome’s First Check up
The three public annotations as well as Celera list the three caspases in the order CASP4, CASP5, CASP1. This differs from the order of these genes in Gene Map ’99, the best pre-sequence map available. It seems fair to conclude that this is a case where the sequence beats the map.
I compared the two public assemblies of this region using Webb Miller’s excellent PipMaker program and found them to be identical over about 90 Kb. Celera’s assembly was 400 bases smaller.
I next looked at the assembly between STSs RH43118 and RH9954. The NCBI assembly contained 1,025,188 bases, while the Santa Cruz assembly contained about 40 Kb more. Since assemblers insert long runs of Ns to indicate gaps in the sequence, I squeezed these out before doing further analysis. This shrunk the NCBI assembly to 1,024,549 bases, but shrunk Santa Cruz a lot more to 957,066 bases, which is about seven percent smaller than NCBI. Celera’s assembly was 1,688 bases smaller still.
Despite the difference in size, PipMaker revealed that the overall assemblies are quite similar. PipMaker’s dot plot showed a clear diagonal line of near identity, broken by some gaps, and two spots where the assemblies are inverted. The text output of the program revealed two long (>100 Kb) stretches of identity — these probably correspond to two finished BACs in the region — and many medium-sized (1-10 Kb) identical stretches.
Based on these examples, it seems that the finished regions of the genome are in fine shape. The unfinished regions still have problems, as one would expect, but the two assemblies are reasonably consistent.
Child-rearing Strategies
NCBI and Santa Cruz have adopted the sensible strategy of relying on NCBI’s RefSeq database as their source of “known genes,” including alternative splice forms.
This worked fine in most of my examples. RefSeq contains five splice forms for CASP1, three for CASP4, one for CASP5, and one for each of the neurexins. Almost all of these were annotated as expected. The only exceptions were CASP1, where NCBI inexplicably left out one of the five splice forms, and two cases (NRXN1, NRXN2) in which Santa Cruz missed entries that were newly deposited into RefSeq.
Ensembl does not rely on RefSeq but rather annotates genes based on its own predictions. The results on my examples were spotty. For CASP4, Ensembl listed only one splice form. For CASP1, it had four splice forms but did not indicate the correspondence between its predicted splice forms and the known ones. I did a little sequence hacking and concluded that Ensembl’s splice forms 1, 2, and 3 correspond to the known alpha, beta, and gamma forms, but all are missing about 25 bases at the beginning. Ensembl’s #4 is closest to the known epsilon form, but doesn’t match very well and is truncated on both ends.
Ensembl’s annotation of the NRXN1 and NRXN3 differed rather dramatically from RefSeq, showing them to be much shorter in terms of transcript size, genomic size, and number of exons. It also reported two splice variants for NRXN2 and NRXN3, whereas RefSeq had only one.
Celera, which also does not use RefSeq for known genes, annotated four splice forms for CASP1, three for CASP4, and one for each of the neurexins. The annotations indicated that CASP1’s sequence was a shortened form of RefSeq’s beta splice form, CASP4 was identical in the coding region to RefSeq’s alpha form, and CASP5 was identical in the coding region to RefSeq’s version of this gene. NRXN1 and NRXN3 were substantially larger than in RefSeq; NRXN2 was about the same size, but with fewer exons.
Baby Books of Life
The strategy of relying on RefSeq doesn’t solve the problem of identifying genes but simply shifts it onto the shoulders of the RefSeq curators.
The identification of genes is fundamentally a biological problem. The sequence of a gene is a scientific conclusion based on evidence and reasoning, not merely data. Computation can help by locating potential coding regions, by coalescing information from ESTs and other transcribed sequences, and by comparisons to other organisms. But in the end, it takes biological expertise to meld all this into compelling biological “truth.”
A database of gene sequences is a form of scientific communication similar to a journal in many important ways. Like a journal, the database should not just report conclusions (i.e. the sequences of genes), but should also include the evidence and reasoning that leads to the conclusion. The database should give the name of the scientist stating the conclusion, both to give credit for a job well done and to help readers judge the credibility of the work. And there must be a peer review mechanism.
RefSeq has not yet stepped up to this level, and it showed in the results. For several of the CASP1 splice forms, RefSeq contained additional untranslated exons that are not present in the GenBank entries from which these were constructed. For CASP4, RefSeq omitted a fourth splice form that was present in GenBank. I’m willing to believe the RefSeq curators got these right, but without an explanation of what they did or why, this requires a leap of faith.
The neurexin situation was worse. The neurexins are fascinating genes with many exons, two known transcription start sites, and multiple independent splice sites. Experts believe that these genes have more than a thousand splice forms, although many fewer than this have yet been observed. None of this complexity was in RefSeq.
I spoke with Lee Rowen of the Institute for Systems Biology whose team sequenced the genomic region containing NRXN3, and who is collaborating with Brent Graveley of the University of Connecticut and others to fully annotate these genes. She tells me that NRXN1 and NRXN3 are huge genes, spanning about 1.1 and 1.7 Mb respectively, while NRXN2 weighs in at a “mere” 100 Kb. Each gene has about two dozen exons with trascripts ranging from about 3,600 to almost 7,500 bases. Rowen also reported that RefSeq’s entries for NRXN1 and NRXN3 were missing the entire first half of each gene.
Celera’s annotation of NRXN1 and NRXN3 were much closer to Rowen’s truth, but still considerably different. The lesson is clear: general purpose annotators cannot possibly be expert in all genes. To do the job right, each gene family must be annotated by the scientists who know the family best.
Genome Grows Up
The young genome has grown nicely this year. The genome web sites at NCBI and Santa Cruz present a coherent picture, at least when working with known genes in finished regions. Unfinished regions are rougher, of course, but the two sites are reasonably consistent. Celera’s version looks the same too.
Growth spurts will come as more scientific expertise gets added to the process of gene identification, so that what’s annotated will reflect the best knowledge about every gene.
Baby genome’s second year promises to be as exciting as the first. The mouse and two pufferfish genomes are on the way, and the human genome will continue to get better. It should be a lot of fun watching these kids grow up.
Mouse and Puffer Baby Genomes on the Way
There’s more good news on the horizon. Two new young ’un genomes are on the way: mouse and pufferfish.
Celera has already declared victory on the mouse and has moved on to the wide world of proteomics and drug development. The public effort, which is actually a public-private partnership called the Mouse Genome Sequencing Consortium, is making good progress and has reached the rough draft stage.
Separately, two groups have taken on the pufferfish, whose compact genome (a mere 350 Mb) makes it an attractive target for comparative studies. One group — led by Sydney Brenner and including the US Department of Energy’s Joint Genome Institute, Singapore’s Institute of Molecular and Cell Biology, and Celera — netted Fuguripes, the famous Japanese delicacy.
The other group, led by Jean Weissenbach of Genoscope and Eric Lander of Whitehead, speared Fugu’s freshwater relative, Tetraodon. The Fugu team announced a complete draft sequence in October 2001. At year’s end, the Tetraodon folks were only slightly behind.
— NG
Baby Genome’s Caretakers
Celera www.celera.com
Compugen www.cgen.com
DoubleTwist www.doubletwist.com
Ensembl www.ensembl.org
GeneMap ’99 www.ncbi.nlm.nih.gov/genemap
Genoscope Tetraodon site www.genoscope.cns.fr/externe/tetraodon
Institute of Molecular and Cell Biology Fugu site www.fugu-sg.org
Joint Genome Institute Fugu site www.jgi.doe.gov/fugu
Mouse Genome Sequencing Consortium mouse.ensembl.org
NCBI www.ncbi.nlm.nih.gov
PipMaker bio.cse.psu.edu/pipmaker
Santa Cruz genome site genome.ucsc.edu
Whitehead Tetraodon site www-genome.wi.mit.edu/annotation/tetraodon
Nat Goodman, PhD, helped found the Whitehead/MIT Center for Genome Research, directed a bioinformatics group at the Jackson Laboratory, led a bioinformatics marketing team for Compaq Computer, and has been consulting ever since. He is currently a free agent in Seattle. Send your comments to Nat at [email protected]