Christos Ouzounis, head of the computational genomics group at the European Bioinformatics Institute, recently marshaled a global team of researchers from EBI, the Whitehead Institute, IBM, and six other research organizations to answer one of bioinformatics’ big questions: How accurate are automated genome annotation methods compared to manual approaches? After countless e-mails, conference calls, and a face-to-face meeting to re-annotate Chlamydia trachomatis, the team published the surprising results of its study in the April issue of Bioinformatics [2003 Apr 12;19(6):717-726]. Perhaps not so shocking was the finding that annotation errors are extremely common — for the original annotation, for example, the team found that domain errors, overpredictions, and false positives made up 13 percent of the total number of predicted genes. However, it turns out that automated systems performed just as well as human experts when it came to overall accuracy. BioInform spoke to Ouzounis last week about the implications of these findings.
Is this the first such benchmarking study to be undertaken in the area of genome annotation?
It’s the first of its kind on this scale. People have done this piecemeal for a number of cases, but this is the first genome-wide re-evaluation of annotations. So it differs in quantity, not in quality.
How did you team up with the international group of collaborators for this project?
It was funded by the European Commission. This was a benchmark for [an automated annotation] system we’ve been using that was developed by Chris Sander and colleagues called GeneQuiz. We’d been using it for quite awhile, but the validation step was not done properly and so exhaustively, so we had to team up with a number of groups to share the load.
Could you walk me through the steps of the process you used to re-evaluate the original annotation?
The original publication was in Science a few years back, I think in 1998 or so, and the annotation was done very well, yet there were some discrepancies. We obtained the results with GeneQuiz some time in 1999, and then we spent a couple of years on and off through this network of labs. We shared the load, we did the annotations three times for each group, and we exchanged all of our annotations and finally compiled the final list. I acted as an editor rather than as an annotator. We encoded this in a structured format just to make things easier, and then we compared what we believed to be the gold standard — our final annotation — against the original published manual [set] and the automatically derived set. So there were two comparisons made across the three sets.
Why was it so surprising that the automatic annotation performed well compared to the manual annotation?
It was very surprising because every time you talk to people who are real experts in genome annotation, they will tell you that the system is just a first-pass thing and then human experts have to put in the last touches. But what’s happening is that we’re spreading [inaccurate] things. We’ve been quite harsh with our criteria, both for the manual and the automatic sets, so if we found any discrepancy we would flag it. If you do this objectively as much as you can, then the automatic annotations are surprisingly good, which means that the human experts are not doing much more than an automated system — they run Blast, they look at the results, and they pick up the best hits and so forth, so it’s actually remarkably consistent, I would say.
Is that because the automated systems work better than expected or because the human annotators are worse than expected?
It’s probably because we humans don’t trust robots or machines and we always think we’re doing something more valuable than a stupid program. But apparently, we’re not doing much more. Of course, we discussed the differences [in the paper]. For example, the [automated] system is designed so that it errs on the side of caution — it doesn’t pick up as much as a human expert does, so these are false negatives. But on the other hand, it doesn’t [introduce] typos or semantic errors.
What is the potential impact of those errors?
That’s another contentious issue because I think everybody knows about it but not everyone wants to talk about it. But we got some really amazing examples. For instance, the next genome that was sequenced that was very close to this one we analyzed picked up a number of typographical errors that were made in the original annotation, and that creates a problem when you are querying a database by keywords. For example, you want to find all the methyltransferases, but if there’s a typo you won’t pick them up by keyword. So it looks very innocent, but if you miss a character in a protein name that would imply that the protein is a kinase instead of a transferase, it could be quite serious. We know this is happening quite frequently in the genome databases because people are copying each other’s annotations.
So what should researchers keep in mind about genome annotation?
We provide a number of guidelines that warn people to be more careful: Use a strategy to be more conservative in terms of assigning functions; clearly indicate the source and whether it’s based on experimental [evidence] or by a similarity-type of prediction; avoid context-free terms such as putative, or possible, or hypothetical, which don’t convey meaning but [are] propagated unnecessarily; use dictionaries such as the Gene Ontology when you’re unsure about exact assignments. Our sequence analysis tools are pretty sophisticated in the sense that they perform an exact test — that of finding similar sequences in a database — but this doesn’t mean that these similar sequences have similar functions. And then this conceptual leap that people make — ‘Oh, I picked up a homolog, therefore I copy the function’ — has to be done far more systematically and rigorously than we’re currently doing.
Did you learn anything in particular that would lead to improvements in GeneQuiz or other automatic annotation systems?
Yes. It would be fantastic if we had a protein database that contained only characterized functions by experiment as opposed to lots of copied annotations from other experiments through similarity. I believe a number of databases including SwissProt are heading towards that by flagging entries that have experimentally determined functions supported by papers, so a user can distinguish which ones have been annotated by similarity and which ones are analyzed in the lab.
It seems surprising that that information wouldn’t already be available.
It’s amazing, actually, because initially SwissProt did contain only proteins characterized by experiment. But over the years and following the wave of genomics, basically, it incorporated annotations that are by similarity. ... It’s actually frustrating to see a science paper contain a number of essentially “false” statements. It’s not the same as if you were to produce a false experiment — the experiment is correct, it’s just that parts of the interpretation are incorrect. In the old days, people would retract a paper, but you can’t really retract a genome paper. The Human Genome Project can’t retract the draft sequence because it’s partially incomplete!
Can you tell me a little bit about the system you created to validate the results of the re-annotation?
It’s an attempt to encode protein function in a more precise way, especially reaction information in terms of EC [enzyme classification] numbers from the Enzyme Committee. First of all, these are extremely useful for downstream processing, for example building metabolic maps based on enzyme definitions. Second, because it would allow users to include information, for example, for protein families in the encoded scheme. We don’t propose it as a standard, it’s just how we internally annotate genomes, and hopefully some good ideas will be picked up. Don’t forget, for example, that the GenBank format, although very sophisticated, was never meant to be used for complete genomes. For example, you cannot encode paralogous genes — genes that are similar within a genome — in GenBank. You can encode start positions and end positions, but this was meant for partially sequenced contigs.
So the GenBank format hasn’t kept up with the data?
That’s correct, and it’s more than the formats — it’s also the semantics of these schemas. The semantics in GenBank are sometimes not used properly by people who submit sequences. For example, there’s a particular field for the EC number in GenBank for the gene product, but people will tend to include EC numbers in a comment line instead of the field. So the format allows enough precision, but users don’t comply with the semantics of the format.
Did you learn anything else from your study?
We tried to collaborate over the internet on this, and we sort of failed in the sense that we would do it over e-mail and web sites and stuff like that, but ultimately we had to invite people locally and do it in one office. What we discovered was that the technology exists for collaborative environments in business and other areas, like the military and finance, and there are virtual collaborative environments for people to exchange notes and comments and images over the internet, but in genomics or bioinformatics we don’t have a custom solution. It would be wonderful in the next 10-15 years, whenever such a solution is viable from the software engineering point of view, that we would have virtual meetings to annotate genomes instead of putting all the people in one room.