It’s the second birthday of the baby human genome (soon to be reborn, I understand, as an operationally complete little tyke). This year’s birthday present is very special: a new sibling. Just what every two-year-old wants. Even better, it’s a mouse.
The grownups are excited by the new arrival, because we can learn so much by comparing the two kids. I spent a few days doing this, trying to unravel a biological curiosity in one small region of the genome. I didn’t sate my curiosity, but I learned about some pitfalls in comparing genomes to answer fine-grained biological questions — problems that never seem to come up in the big papers.
A New Arrival
When the human genome arrived two years ago, I rushed to peer at my favorite gene using the new genome browsers of the day. I still remember the excitement of seeing that gene against the background of so many other features aligned to the genome. What was most striking was the large number of ESTs splattered across the region, and even more so, the large number of these ESTs that hit my gene’s introns — totally unexpected.
Ever since, I’ve wondered what those intronic ESTs might mean. Are they evidence of unsuspected alternative splice forms of my gene? Are they novel noncoding RNA genes, perhaps a new form of microRNAs? Or are they just experimental junk?
When the mouse genome joined the family, I finally had a way to answer these questions. I reasoned that if the ESTs were real, they’d light up in a comparison with mouse. It seemed a straightforward plan: align the human and mouse genomic regions, plunk the ESTs onto the alignment, and look for ones that hit conserved regions. Of course, it didn’t work out that way. When it comes to families, nothing is ever as simple as it sounds.
Getting to Know You
The gene in question is the Huntington’s disease gene. It’s a large gene, spanning about 169 kb on human chromosome 4. The gene has 67 exons which splice to form a transcript of 13,672 bases. The coding region is 9,435 bases leading to a big protein of 3,144 residues.
The database entries for the mouse ortholog, Hdh, are less definitive, as one might expect from such a young babe. I based my work on genome assembly MSGCv3 — the one used in the mouse genome paper — and annotation Build 29 by the US National Center for Biotechnology Information.
The NCBI genome data shows Hdh to be about 13 percent smaller than the human gene, spanning about 147 kb on mouse chromosome 5. The size is not surprising, since the entire mouse genome is about 14 percent smaller than human. The transcript is 10,032 bases, the coding region 9,363 bases, and the protein 3,120 residues. The human and mouse proteins are about 90 percent similar.
One complication in these numbers is that the HD gene contains a CAG/CAA repeat (which translates into a polyglutamine repeat) whose length varies among individuals and mouse strains. The human numbers are for an individual with 18 repeats, while the mouse data is from a strain with seven repeats. Note that the human reference sequence for the gene has 23 repeats. (Expansion of this repeat causes the disease. Humans almost always have a pure CAG repeat which is susceptible to expansion, while mice usually have a mixed repeat which is resistant.)
Baby’s First Steps
I downloaded the human and mouse genomic sequences for the gene, including 10 kb extra on either side. To get the exons, I grabbed the NCBI reference contigs for these regions (in GenBank format) and parsed them with BioPerl. To learn which ESTs hit the region, I went to the University of California at Santa Cruz genome website, downloaded table chrN_est for the region, and parsed it with an ad hoc Perl script.
A simple Perl script later, I determined that 422 ESTs hit the human HD region and that 191 of these (45 percent) were intronic. I defined an EST to be intronic if at least one of its matches to the region fell 50 percent or more in an intron or outside the boundaries of the gene. The term non-exonic would be technically more accurate, but a bit of a mouthful.
The next step was to see which of the 191 intronic ESTs were well conserved with mouse. I did this by aligning the human and mouse sequence with Webb Miller’s PipMaker web server and visualizing the results with his Laj alignment viewer.
The PipMaker/Laj combination works pretty well, but it takes some work to install all the supporting programs and to generate the input files it needs in the required format. I found this to be the case with all the genome comparison tools I looked at, which is a serious impediment to trying out new software.
PipMaker compares a test sequence to a base sequence. It operates on four input files that describe features of the base sequence, in addition to the sequences themselves: exons; repeats; annotations you want displayed in the output; and “underlays” that define positions you want specially colored in the output. The PipMaker suite includes a collection of programs for preparing some of these files, but I still found it necessary to craft a good number of Perl scripts to set up the input and analyze the output.
Laj is a Java program for visualizing long genomic alignments. Laj takes the same files as PipMaker, plus the alignment generated by PipMaker.
The program presents information in five horizontal panes. The horizontal axis represents the base sequence in all panes. The top pane is a dot plot comparing the two sequences. Since the human and mouse HD regions are reasonably similar, the dot plot shows the usual near-45-degree line of similarity with sporadic breaks.
A few panes down is the Pip (percent identity plot) which shows for each short region of the base sequence the percent identity of its best match to the test sequence; to reduce the visual clutter, the program only shows matches above a threshold (50 percent by default). What you see, especially when you zoom in, are long-ish horizontal lines corresponding to exons, nestled among stretches of multiple short lines indicating regions that are weakly conserved, and empty spaces for regions that are not conserved at all. The Pip is colored in accordance with the underlays file; I used red coloring for exons so they’d stand out clearly. Other panes show the annotations specified in the input (ESTs in my case), repeats, and the aligned sequences.
My initial foray with Laj wasn’t very informative, because it was hard to pick out the anomalous ESTs from the sea of normal ones. It took another Perl script or two to filter the list of ESTs to ones I cared about, namely intronic ESTs that were highly conserved in mouse, defined as hitting a region of the PipMaker alignment with 75 percent or more sequence identity. Only 49 ESTs passed this filter.
When I looked at the reduced dataset in Laj, several observations jumped out. Most of the ESTs were hitting regions of spotty similarity, with only short stretches of strong identity. Many landed in repeats. Many overlapped with exons, but extended a long way in the neighboring introns. In other words, not what I expected — more questions than answers.
The only really strong match I found was an unspliced EST, BE883961, that lies about 4kb beyond the 3’ end of the gene. Blasting this EST against GenBank showed a strong but partial match to a RIKEN full-length mouse cDNA. Blasting the RIKEN sequence showed a strong match to a predicted human gene that lies exactly in the same region as my EST, but which only contains a small fragment of my EST. Huh? This shouldn’t happen.
Something real must be going on, but what?
Two kids are more work than one. To learn anything useful at a detailed level, you have to keep your eyes on both at the same time, and watch them very closely. There’s twice as much data to keep track of, and many more analyses to be done.
Good software and databases are essential, of course. A big issue is that comparative genomics software is complex, and it takes more than the usual amount of work to get any package up and running. I used just one package for this article — Webb Miller’s PipMaker suite — but several other packages are available and deserve a look.
The genome family is only going to get bigger. That’s a good thing, but we’d better get our software house in order before too many kids start crying.
IT GUY SAYS: Bring Back Open Data Release
A debate is raging about the data release policy that governs what people can do with the sequence data produced by the Human Genome Project. The outcome of this debate will affect large-scale biological projects for years to come.
The public Human Genome Project has been a leader in the open release of biological data. The HGP’s longstanding policy has been to release data promptly and without restrictions. This policy was the standard practice of many genome centers since the inception of the HGP, formalized at the famous Bermuda meeting in 1996, and reiterated in July 1999 in a policy statement by the US National Human Genome Research Institute. It was also the centerpiece of a much heralded joint statement by President Clinton and Prime Minister Blair on March 2000 celebrating the completion of the human genome draft.
Then, in December 2001, the policy changed. The new rules prohibit people from publishing large-scale analyses on the public sequence data until the sequencing centers publish their own results. Lee Rowen et al, in an essay published in Science, go a step further and argue that this prohibition should apply to all computational analyses of unpublished sequence data, or more precisely, to all analyses “that the sequence producers could reasonably have planned.” Richard Heyman, in a letter to the editor of Science, piles on and argues that violation of this policy should constitute scientific misconduct on the level of plagiarism.
This seems a far cry from the unfettered access envisioned in the pioneer days of the HGP.
I firmly support the original open-data policy of the HGP and hope the current restrictions are dropped. This is consistent with my support of open-source licensing for academic software.
Genome sequencing and other large-scale data production projects are oligopolies. The funding agencies assemble a small number of teams, divide the work among them, and refuse to let anyone else participate. This is a sensible approach given the high cost and utilitarian nature of such projects. But it is important to draw the boundaries of the oligopoly as narrowly as possible so that competition is not unduly restrained.
The natural boundaries of the sequencing oligopoly are rather clear: the huge cost lies in generating the data. Everything outside this boundary should be fair game for competition.
The counterargument is that the big sequencing groups will get scooped and be denied their fair credit if competitors are allowed to analyze the data first.
Let’s use the mouse genome paper to understand the implications of this argument. The publication is a mega-paper with 222 authors that reads a lot like an anthology. There are sections on sequencing strategy, assembly, synteny, repeats, genes, proteins, evolution, and consequences for mouse genetics. Each section could stand as a separate paper, and I suspect that each was primarily authored by a different group of people.
The net effect of this bulk authorship is to hide the contributions of each author and deny proper credit to most of the people who worked on the paper. Insiders can guess who led the work on each section, but it’s never spelled out.
This mega-paper approach flies in the face of the growing sentiment in the broad scientific community for increased responsibility and transparency of authorship. This sentiment is driven by concerns over scientific fraud and excessive honorary authorship, as in the case of the physicist Jan Hendrik Schön. These are not issues here. Still, the central principles apply: scientists should get credit for the work they do, not for the work they’re associated with; and authors must be willing to accept responsibility for anything they take credit for.
The people who generated the sequence data were not, for the most part, involved in its analysis and have no right to claim credit for the analysis. Conversely, the data analysts have no right to claim credit for the data production. Each part can and should stand on its own. Once this is done, the argument for banning competition in the analysis arena dissolves, and we can return to the sensible policy of open data release. — NG
What to Expect
|Santa Cruz genome site||http://genome.ucsc.edu/|
Toys for the Nursery
|PipMaker, Laj||Bioinformatics Group,
Pennsylvania State University
|Vista, GenomeVista, Avid||Genome Sciences Department,
Lawrence Berkeley National Laboratory