Ewan Birney is a team leader at the European Bioinformatics Institute in Cambridge, UK, and the director of the Ensembl database.
Nat Goodman, Genome Technology’s IT Guy columnist, recently reviewed the human genome one year on. To do this he took two gene families he knows well and compared them on three web sites — NCBI, UCSC Genome Browser, and Ensembl. In the process of doing that he discovered that Ensembl messed up in both cases (mis-classifying the CASPase genes and also just straightforwardly missing the majority of the Neurexin-3 gene). Neither NCBI nor UCSC made as bad an error as Ensembl in Nat’s view.
In both of these cases he was right. We certainly misclassified genes here and missed most of Neurexin-3. Impressively, Nat has discovered nasty corners of the classification process and the gene-building process, both of which we were then, and still are now, trying to fix (we’re better this release than last, but still not good enough). This happens due to tradeoffs between different types of accuracy. It is good for people to understand what these tradeoffs are in order to understand how to use the different tools.
First, to reassure Ensembl users: one of the new features on the latest Ensembl is that you can switch on all the tracks without the tradeoffs. In particular, you can switch on the RefSeq track to see just RefSeq genes “unspliced” on the genome and the NCBI track (found under the DAS menu) to see the gene models from NCBI. This release also has a new track, Ensembl EST transcripts, that lets you see a gene build solely from the perspective of ESTs.
As to tradeoffs in the gene build process: Nat was coming at the site from the perspective of a user knowing a lot about some genes. In this case you want to see the previous understanding of the human genome unspliced as perfectly as possible onto the human genome. Other people come to the genome from a different perspective. They might be building a microarray or starting a specific cDNA resequencing effort in a population. A protein families expert has yet another view of the genome. In all cases, users ideally want the closest to the truth. And different users respond different ways to errors.
For a gene hunter who wants to find all possible genes in a particular region, attempting to predict a gene structure from any EST is very useful, even if some of the ESTs are just plain wrong (due to genomic contamination) or seemingly (due to a library cloning error ) predict a gene in the reverse direction to a well-known gene. The gene hunter would want to see the genes to be able to design primers, but this would be disastrous for the microarray builder.
In the case of Neurexin-3, we had it bang on, 100 percent unspliced on the genome at the start of our gene-build process. However, there is a class of gene-building artifact that is triggered by missassemblies. These give long, spindly genes across many megabases because a number of their exons are tied down to missassembled region. If we promoted that as one of our confirmed genes, then it causes havoc with people who are, for example, trying to build a sensible SNP set for each gene. Suddenly there are many 2 MB “artificial” genes occuring in the database. So we remove these cases with a “spindly gene” catcher. Neurexin-3 is just so shaped that it is one of the good guys getting thrown away with a lot of trash.
In this case we already have a solution in testing: For good genes, with good cDNA support across all of the gene, keep them even if they look odd. But in each case, refining the tradeoff to make less errors and find more gene takes time.
Both UCSC and NCBI aim to be useful to people who know a lot about known genes and (in the case of UCSC) want to see all the evidence, without filtering and (in the case of NCBI) want to see the integration with the existing resources in Locus Link.
Both let in many, many more errors. That is, from our perspective, sensible, because most people lazily assess on what they know, not how many errors it makes.
Ensembl’s gene set is targeted at being useful to many different audiences, but in particular to people who want to build on top of the genome information, whether by future wet experiments or computationally. We have always been conservative, but aim for a complete, high-quality data set that other people can build on. For example, we are still the only public site that has consistently built a protein dataset that we believe covers all genes which can be found in the human genome with a low error rate — an immensely valuable resource for downstream work.
As opposed to Nat’s assessment of a very small sample set (two gene families, with six genes in total), in a spot test that I did of a random set of 10 genes, all three sites essentially agreed on all 10 genes. Ensembl missed two UTRs, whereas NCBI had one reasonably clear-cut misprediction overlapping a real gene. It is hard to rate UCSC’s results as they present all evidence (including Ensembl’s gene predictions) with just the user to interpret it. No site missed one of these genes or misclassified them, which is what I would expect.
In any case, Ensembl will maintain its quality over this year for human, and go onto other genomes, such as mouse and zebra fish. Our system is also being used by other groups on genomes as diverse as fugu and rice — a tribute to our software openness. Quality genome annotation for all users and all uses is our goal.
Opposite Strand is a forum for readers to express opinions and ideas about trends and issues in genomics. Submissions should be kept to 550 words and may be submitted to [email protected]