Skip to main content
Premium Trial:

Request an Annual Quote

The Wicked Truth About Annotated Data


It’s been six months since the leaders of the free world declared the human genome sequence finished, yet there is still no public database where you can find reliable assemblies and annotations of the genome. This is an incredible shame, because the human genome sequence in its raw form is not terribly useful to most biologists.

Two top-flight road crews are working feverishly to pave usable paths to the genome. Both trails — the Yellow Brick Road and the Golden Path, as they are known — are still under construction. Both are bone-jarringly bumpy in form and content.

The crews are not at fault for the poor conditions. To the contrary, these folks are heroes who were pressed into service at the last minute and given too little money to get the job done on time.

For now, if you want to use the genome in your research, you might be better off taking a detour to the private sector. Here’s why.

Because, because, because, because, because…

Before Celera came barreling along in the rearview, the public human genome effort was cruising on down the road at a comfortable speed. The public effort took several shortcuts to keep its position, which allowed it to reach the finish line in a dead heat with Celera, but left a big mess for the assembly and annotation people to clean up. Six months later, work crews are still digging out.

Pre-Celera sequencing strategy was slow but sure: the genome would be divided into a large number of smallish pieces called bacterial artificial chromosomes (BACs); those would be mapped to positions in the genome; the map would be used to select a minimal set of BACs to be sequenced to completion; those sequences would be assembled to yield the complete genome, a process seen as straightforward with a map to guide the way.

Annotation would be the project’s final step. In addition, because mapping is an imprecise art at best, the plan included error-detection procedures that exploited the growing body of sequence data to correct map errors and keep sequencers on track.

It’s a twister

What really happened was a lot more chaotic. To save time, the big sequencing push began before the map was complete. To save even more time, many BACs were only partially sequenced — an abomination that was euphemistically dubbed “draft sequencing.”

Such shortcuts greatly reduced the effectiveness of error detection, inevitably resulting in mistakes. Many BACs were mapped to incorrect locations (often the wrong chromosome). There were also surprising numbers of human errors. For instance, it wasn’t unheard of that someone intending to sequence a certain BAC pulled the wrong plate from the freezer and sequenced a completely different one.

Ultimately, the map could only serve as a rough guide for assembly, and teams doing the work had to invent new algorithms on the fly. Not surprisingly, this turned out to be a hard problem.

The purpose of genome annotation is to identify biologically interesting features in the sequence. Some annotation, such as repeat-finding, is simple and merely requires running the sequence through standard algorithms. A subtler method is to overlay known sequences such as genes, STSs, and SNPs onto the genome.

Beyond these simpler tasks lies a series of hard, open research problems such as predicting genes and identifying regulatory sites. Draft sequences, which are short stretches of a few thousand bases rather than long continuous stretches spanning megabases of the genome, make these tough tasks even tougher.

Annotation wizards of East and West

Who knows why the funding agencies never supported genome annotation research with much fervor? Perhaps they thought annotation would be easy. Or perhaps they were practicing just-in-time research funding and planned to start funding this research closer to the original target date of 2005. Whatever the reason, the genome came as a rude awakening when it appeared in mid-2000. No one was ready to annotate it.

Two major public websites — one operated by the US National Center for Biotechnology Information and the other by the European Bioinformatics Institute — provide assembled and annotated human genome sequence. Greg Schuler leads the team producing NCBI’s resource. EBI data is produced by a joint US/European collaboration in which a group led by David Haussler at the University of California at Santa Cruz builds the assemblies and a team led by Ewan Birney at the EBI annotates.

EBI’s software resource, which Birney’s team has made available as open source, is called Ensembl. Assemblies produced by the Santa Cruz team and imported into Ensembl are known as the Golden Path.

The NCBI resource is simply called the NCBI Human Genome Resource, but the Yellow Brick Road is the unofficial name for its assemblies.

A third website operated by Jim Kent, a graduate student at Santa Cruz, is also worth visiting. This may be the best of the bunch right now, though I have to wonder how long a grad student project can lead the pack.

Behind the screen

For the Yellow Brick Road, go to NCBI’s homepage and click on “human genome resources” in the right-hand panel. There you’ll find a search bar near the top of the page; pull down the “maps” option, and type a gene or other symbol in the search field. (Use gene symbols or accession numbers, not gene names.) A rather useless page locates your symbol on an ideogram of the entire genome and lists other maps that contain your feature. If the symbol you entered has been annotated on the NCBI resource, the list of other maps will include “Genes seq.” Clicking on this will take you to a display that gives a high-level view of your feature on the genome sequence. NCBI refers to this display as a “sequence map.”

At Ensembl, which has its own homepage or can be reached from EBI’s homepage, you’ll find a Universal ID Search field. As with the NCBI resource, you can enter symbols or accession numbers, but not gene names. If you enter a gene symbol that has been annotated by Ensembl, the search takes you to a page containing a table with lots of detailed information. To see a display of your gene on the genome sequence, click on the entry in the table’s “genome location” row. If you enter a symbol for an STS or other non-gene feature, you’ll be taken to a map view instead of the detailed table view. From here, you can get to a sequence map view through a menu that is revealed when you run the cursor over the mapped symbols.

I’ll show you, my pretty

I tried several examples on both the NCBI and Ensembl sites and got generally poor results.

First, I examined the genomic region surrounding caspase-1 (CASP1). Two other family members, caspase-4 and -5 (CASP4, CASP5), are known to map near CASP1. NCBI placed the three genes close together as expected, but changed their order relative to that published in Gene Map ’99. NCBI has the correct sequence for the three genes, but only reports one of the five known splice variants of CASP1.

Ensembl, at first blush, only showed CASP1 and CASP4. Looking more closely I realized that CASP5 was present, but had been elided into a single gene with CASP1 and was listed as a synonym for CASP1. The error is surprising because CASP1 and 5 are much less similar (about 50 percent identity) than are CASP4 and 5 (about 75 percent identity). Ensembl reported several predicted sequences for CASP4 and their 1-5 hybrids, but did not include the correct, known sequences of these genes.

The assemblies presented by the two sites were quite different also. NCBI’s assembly spanning the three caspase genes contained 88,812 bases; the corresponding Ensembl assembly contained 220,323 bases. Most of the unaccounted for DNA was in one large 120kb chunk; there were also several small (roughly 10kb) missing/extra pieces elsewhere in the region. I have no clue as to which assembly, if either, is correct. Even more troubling: I have no idea how to figure out which one is right without going through all the work of reassembling the sequence.

Click your heels three times

Next, I looked at a gene called neurexin-3 (NRXN3) — a long gene with a lot of small exons. I was surprised to find that, although it is in RefSeq, it is not present in the NCBI annotation; I imagined that NCBI would be fastidious about annotating everything in its premier reference database. NRXN3 is present in Ensembl, but with the wrong sequence. Ensembl lists two predicted sequences for this gene: one adds 140 amino acids to the front of the published sequence, and the other adds 160 amino acids. Both end in the middle of the published sequence.

I also tried one EST example, arbitrarily picking a cluster from UniGene, Hs30732, which did not have a full-length sequence, but is mapped to chromosome 9, band q22. I selected one constituent of the cluster, an STS with accession D20387, and tried it on both sites. The symbol D20387 did not work on NCBI, but STS-D20387 did; the opposite held at Ensembl. Another member of the cluster, EST-N94457, failed at both sites.

While examining the region around the STS at NCBI, I noticed that the cytogenetic position of an “aligned” map bounced from 9q22 all the way to 9q34, which makes me leery about trusting other information from aligned maps.

At Ensembl, I saw the gene symbol GRG1 near the STS. Pushing the GRG1 link took me to a page that listed TLE1 as a synonym for GRG1. Pushing the TLE1 link, I learned that TLE1 is a known gene that maps to chromosome 19, not chromosome 9.

The public websites are not yet reliable routes to the human genome. There are so many errors that you really should confirm everything you see — that’s a lot of work. The public crews may never get their roads built, because they have to cut through the jungle of draft sequences. If the public sector really wants to beat Celera, they’d better hunker down and finish the sequence.

An alternative is to turn to private purveyors of annotated data, namely Celera and DoubleTwist. I haven’t looked closely at these products and can only hope that they’re better than the public resources. With Celera’s agreement to make its data available for free (albeit with restrictions), it seems destined to become the first superhighway to the genome. Assuming, of course, its data is really good enough to carry the traffic.


NCBI home page

NCBI Human Genome Resources

EBI home page

Ensembl home page

Jim Kent’s annotation site

The Scan

Study Reveals Details of SARS-CoV-2 Spread Across Brazil

A genomic analysis in Nature Microbiology explores how SARS-CoV-2 spread into, across, and from Brazil.

New Study Highlights Utility of Mutation Testing in Anaplastic Thyroid Cancer

Genetic mutations in BRAF and RAS are associated with patient outcomes in anaplastic thyroid carcinoma, a new JCO Precision Oncology study reports.

Study Points to Increased Risk of Dangerous Blood Clots in COVID-19 Patients

An analysis in JAMA Internal Medicine finds that even mild COVID-19 increases risk of venous thromboembolism.

Y Chromosome Study Reveals Details on Timing of Human Settlement in Americas

A Y chromosome-based analysis suggests South America may have first been settled more than 18,000 years ago, according to a new PLOS One study.