The Gene Ontology project has always been a collaboration of researchers from various model organism and other databases — the GO began to develop ontologies to describe gene products for a variety of organisms more than a decade ago. But for annotation, Swiss Institute of Bioinformatics' Pascale Gaudet says that each group was using its own methods. "We decided that it would be a lot more productive if we got together, we were mature enough to work together and try to benefit from each other's experience and knowledge of biology to do the annotations more or less together in a more coordinated way," she says. And thus began the Reference Genome Project.
The goal of the GO Reference Genome Project is to develop consistent annotations from about a dozen model organisms in the GO database, and includes members from the various model organism databases, including Arabidopsis thaliana, Caenorhabditis elegans, and Mus musculus. The group aims to bring together the knowledge of these diverse model organisms groups to inform each other's work. "The knowledge is very complementary and so the first intent was to pair the knowledge between the different groups, which means that what we can learn from yeast genes may be transferred to predict the function of human or mouse or fly genes ... and vice versa," Gaudet adds.
The knowledge may be complementary, but the groups' approaches aren't always, making setting priorities for what genes the entire project should focus on tricky — the project first tried one approach before switching to a new one. At the same time, the project members have been developing and trying out a new annotation tool that is just about ready to go into wide use.
"It's an exciting project, it's a fun project, it's really on the cusp, the frontier, the forefront where I think annotation is going in the future," says Mike Livstone from Princeton University, who was part of the project for about two years.
The Reference Genome Project has a lot to tackle, and each group that is part of the project has its own interests, so setting priorities doesn't come easily. At the beginning of the project, Gaudet says the group would pick genes from newly published articles to annotate, and aimed to annotate about 20 genes a month. "So every group would say, 'Oh there's a new mouse study that is very interesting,' or 'There's a really cool fly paper that studies one gene or newly characterized genes.' And so we tried to do that for a while," she says. "That was nice; it gave everybody a chance to bring up the value of the system they are studying."
The problem was, though, that they were trying to cover too many disparate areas at once. "So, 20 genes of different areas of biology can be very difficult to annotate, because you start working on one subject about something and then you have to switch gears," Gaudet adds.
An additional challenge is that each group also has its own culture peculiar to that lab or its own way of doing the work. "One group gets used to working on understanding a certain aspect of biology or interprets some experiments a certain way, and it's possible to use different species to understand other aspects of a group of proteins, but when you try to bring all that together, it can be challenging," Gaudet says. "And there's lots of fun arguments about all that." Members of the Reference Genome Project get together about twice a year, and also have phone and Web conferences.
[ pagebreak ]
Now the project is taking a different approach to setting its annotation priorities, and has been developing a tool to help in that annotation process. Instead of mining the recent literature, Gaudet says that for the last year the group has been following a more biologically determined way to set annotation priorities — what she calls a "textbook" approach. "You imagine you have the table of contents in a cell biology textbook, and you just go through it: transcription, translation, nucleotide biosynthesis, amino acid biosynthesis, and so on," she says. "We first want to do all the basic processes so that, thinking that the users, what they really need from the GO is to have at least some basic information about every gene."
For example, she says they are currently focused on annotating transcription factors involved in heart development, which is an area of interest for one of the collaborating groups. Next, they will focus on apoptosis. "For the annotation process, it helps us focus on one area of biology," Gaudet says.
Most of the annotation work to this point has been based on experimental studies, but for the past two years or so, the group has been developing a Java-based annotation tool based on phylogenetic trees. They began to work with Paul Thomas at SRI, who now oversees the Protein Analysis Through Evolutionary Relationships, or PANTHER, system that was developed by Celera Genomics and Applied Biosystems.
Princeton's Livstone also worked on the tool development, which brought together experimental annotations from the different model organism groups. "Each group has its own annotation standards or culture, so that had to be taken into account," he says, adding that the state of knowledge for the different model organisms also varies. What Livstone did to develop the tool was to look at all the experimental annotations and make a two-step inference. First, he would determine for each protein function the point in evolutionary history when that function evolved and the ancestor in which it evolved. The second step was to assume, if there was no evidence to the contrary, that the descendents of that protein likely had the same function. "In that way, you can transfer annotations from one or more extant proteins to additional proteins," he says.
Gaudet adds that they've been testing the tool and training curators to use it, and it's now ready to use. In addition, a paper describing the tool and its approach is in press at Briefings in Bioinformatics.
Tools like this one, Livstone says, will also help alleviate the pressure on curators to do more work with less money. "One of the ways you can allow a curator to make more annotations in the same amount of time is if you magnify them using, for example, this kind of framework that allows them to transfer one annotation from an ancestral protein and it goes to, say, 50 to 100 descendents of that protein — one decision, one evaluation has a much broader impact," he says.
Of course, new information is continually coming out about various genes and new evolutionary and phylogenetic relationships, so annotation work is difficult, if not impossible, to ever finish. Gaudet says that the group should cover, with one pass, the approximately 7,000 PANTHER protein families during its next grant cycle. "And then of course annotation is never finished because there's always new knowledge," she says, adding that "things need to be maintained if they are to remain useful tools, otherwise the information database and the analysis are not as -meaningful as they could be."
Model Organisms and participating database groups:
Arabidopsis thaliana (The Arabidopsis Information Resource)
Caenorhabditis elegans (WormBase)
Danio rerio (Zebrafish Information Network)
Dictyostelium discoideum (dictyBase)
Drosophila melanogaster (FlyBase)
Escherichia coli (EcoliHub)
Gallus gallus (AgBase)
Homo sapiens (Human UniProtKB-Gene Ontology Annotation)
Mus musculus (Mouse Genome -Informatics)
Rattus norvegicus (Rat Genome Database)
Saccharomyces cerevisiae (Saccharomyces Genome Database)
Funding: According to Gaudet, about half of the participating groups have funding from the Gene Ontology consortium for both this and other GO projects, while other groups rely on their own funding.
Timeline: Ongoing, and plans to continue for five years