Ross Overbeek is developing a computational framework to annotate 1,000 genomes over the next three years. Overbeek, who started his career in the academic world and then worked at Argonne National Laboratory before moving into the private sector, eventually came to the decision that the best mechanism for the annotation system he had in mind was a non-profit organization dedicated to the task. So in May 2003, he left his post as vice president of bioinformatics at Integrated Genomics, and with four colleagues, launched the Fellowship for Interpretation of Genomes (FIG). The grant-funded effort is developing open-source software to support the analysis and annotation of newly sequenced genomes.
FIG calls this computational framework the Seed: The system integrates multiple genomes so that researchers can use data from several organisms to aid their own annotation efforts, and share their annotation information via a peer-to-peer network. One of the goals of the Seed is to take a step beyond other comparative genomics tools, which are based largely on sequence homology between genes, to allow comparisons of “subsystems” of functional elements that work together to carry out biological processes.
BioInform recently spoke to Overbeek about this subsystem-based approach to annotation and the computational tools he is developing to support it.
What was your motivation for forming FIG and launching the 1,000-genome annotation project?
I’ve been involved in genome annotation for many years. I started working at Argonne National Laboratory developing computational support for biologists in 1989. Starting around 1994, we began building systems to support annotation of genomes. But as the number of sequenced genomes increased, I realized that it would be more effective to have a clear understanding of subsystems — that there would be a higher level of annotation accuracy with this approach.
The biological community has operated for many years by having experts annotate a single subsystem — that’s been the basis of review articles. I began to believe that there’s a crisis in annotation … In most of the sequencing world, you’re funded to sequence and annotate a single genome. The natural unit of annotation was the organism — you take a single genome and you annotate it. You have a team of people skilled at the annotation of a single genome at a time, and they form a team and annotate it over a period of a few months or longer; they could spend as much as a year or two before it is published.
Now that we’re getting a new genome every day or two, automated tools are critical, but these are structured as pipelines — the sequence data for the organism goes through a set of tools, and automated annotations are gradually built up. In cases where the annotation was done properly to existing genomes, the results will propagate pretty well [to new organisms]. But in cases where there are large numbers of paralogs, you either propagate errors or you have to annotate on a general level, with family assignments and so forth. The quality of human expert annotation based on subsystems produces far more precision than this.
What do you mean when you refer to subsystems?
Most people might be more familiar with the term pathway. It’s a set of components that work together to achieve a unified end. So a subsystem may have four or five genes that work together in the process of leucine degradation or glycolysis, for example.
There’s a huge amount of disagreement about what is a subsystem, what is a pathway. I used to worry about defining it, but found that the best way to approach it is to allow the expert to define what the subsystem is. If you have two experts looking at the system, they may both do it differently — it involves matters of judgment and taste. If two experts produce overlapping subsystems, that’s fine — having something annotated multiple times is acceptable; but it’s not useful to have a computational person make the decision about what constitutes a good pathway.
A complex like the ribosome might have 50-60 genes; glycolysis may be treated as a single pathway or broken into smaller components — how it’s done is a matter of taste. We need experts who understand what the set of acceptable variations is for each subsystem. If you rely on computing, you will be limited to one concept of glycolysis, but a true expert will understand the finer points of variation. You can have the same subsystem in different organisms, but the differences may be clear only to an expert. And it’s these variations that frequently get misannotated. … A human expert will look at more clues than an automated system.
Currently, annotation is done by looking at a single gene at a time, using similarity [to genes in other organisms]. But you could take a different perspective by using subsystems; if you take a subsystem with eight components and find six, but there are questionable candidates for the remaining two — the minute you know that six of the eight are recognized already, it increases confidence in the annotation of the two questionable calls. There are about 100 organisms now that are well sequenced and annotated, but we’re going to get about 1,000 more genomes in two or three years, and that leap from about 100 to 1,000 genomes is going to be addressed by subsystem experts. They’ll need computational support.
How many of these subsystem experts are there, compared to biologists who focus on, say, model organisms?
You’ll find that most experts in entire organisms are experts in a particular subsystem. They just center much of their work on a single organism so that they can carry out the wet lab work they need to do to study that system.
Can you tell me about the tools you’re developing?
The framework where we can integrate all this genomic data is called the Seed. It was developed by a number of collaborators in an effort led by FIG. We hold meetings periodically, and the next one is scheduled for Germany in July.
In terms of a framework for integration, good systems already exist like KEGG and Swiss-Prot and the resources at NCBI. We are not unique in forming a platform for integrating data. But we think the system we’ve developed is a wonderfully useful framework for comparative analysis. I use it to store 280 genomes on my Mac.
The other part of the software is a peer-to-peer technology because we want people to exchange components. So if I have a friend in San Diego working on vitamin metabolism, we can make our components exchangeable. The system will go over the network and accept his annotations and install them on my system. There’s no central repository of the truth [regarding the correct annotation]. Some people are more cautious, while others are more speculative. My view is that all annotations are speculative — you have an assertion with some probability, but we’re all seeking a consistent model of what’s going on.
The model is that you’ve got hundreds of experts annotating subsystems, and you have other people who either trust [the annotations] or don’t. If they trust them, they can take the annotations. If they don’t, they don’t. Eventually, we believe that there will be “central collections” — not by design, but de facto. The community may care a lot about a certain class of annotations and access them more often, but there’s no blessed central authority.
How does this approach compare to DAS, which is also a P2P annotation system?
They’re closely related, because you can collect annotations from people working on the same genome. DAS supports integration of annotations from distributed sources, while the primary focus of the Seed is on the tools used to develop the annotations. There is an overlap in capabilities in the sense that both technologies support distribution and acquisition of annotations. It would be reasonable, for example, to support an instance of the Seed as a DAS server.
What is available in SEED now?
Right now, I’m preparing a release. We’ve gone through a point where we’re beginning the initial distribution, so we’ve got bug reporting mechanisms in place and so forth. Basically, this whole process hasn’t been funded — it’s largely being done by volunteer labor at this point. There are now about 15 installations of the Seed. Six months ago, we would all get together for a party, and we would take along some disks and simply copy the current version and get it running on different people’s Macintoshes. Then we implemented versions on Linux systems. It now runs under Postgres or MySQL. We generally distribute it in one of two ways: we either send out DVDs or we download it over the network. But it’s fairly large. It’s on the order of 50 GB, so you don’t want to just download from your college or whatever. DVDs will probably be the most common form of distribution.
Essentially we try to put out a new release every three to four months. And we’re not responsible for the annotations in that at all. There are two components: We’re providing raw releases that have the genomic data, and there’s a clearinghouse emerging on the annotations. So people can get their system installed, and then go to the clearinghouse and install as many of the subsystems as they wish. So they tailor their own annotations. This way, we’re not responsible for trying to get good annotations on the initial release. The annotations are coming from numerous sources at this point.
We do have a public server running at the University of Chicago [http://theseed.uchicago.edu/FIG/index.cgi]. There are actually several emerging, but the one at the University of Chicago is probably the one that most people know about. It’s a beta test version. We’re still at the stage where things go wrong, and we fix them. I think that release has new environmental data produced by Craig Venter’s team on it [the Sargasso Sea data]. So we do occasionally get ahead of the game — or, at least, reasonably current — but we’re behind in several other ways.
[The SEED] is hopefully the first of many open source tools, but this is the one that we’re going to base everything on. So it is going to grow into a much larger system.
How many organisms are available in the Seed?
Right now, we’re not distributing the full thing because some of the organisms have restrictions on redistribution. So I believe what you’ll see if you go there is around 220 organisms. We do occasionally give out the larger number that has genomes from Sanger and JGI and so forth, but that’s [only] to people who are using them for their own personal research and agree to the restrictions.
The system is really designed so that people can add their own data to it fairly easily. It struck us early on that most of the people annotating genomes don’t have access to a framework to do the comparative analysis. What happens is once the genome is annotated, it’s made available in systems like KEGG and Swiss-Prot and NCBI. But before it’s released, when the serious annotation work is being done, the user seldom has access to an environment for examining it in the context of other sequenced genomes. Basically what they can do is get similarities to other genes, and that’s what they use to annotate it, but it’s only after the initial annotation is done that the organism becomes available in the context where you can really work on it. So what the SEED offers is a painless way to have access to a full framework for comparative analysis before you release your data to the public archives. You just install your genome on your own version of the SEED, you work on it, and when you’re ready to release the genome, you put it in the archives and it drifts to everyone.
What’s next on your list of short-term goals?
There are numerous things we’re working on. Probably, this year, we’re going to have to focus on two things. One is the interpretation of microarray data, and the other is connecting the central machinery to medical topics like cancer. Most people view the human genome as the core machinery surrounded by regulatory mechanisms. The core machinery is relatively small, the regulatory mechanisms are relatively large, and many medical disorders are in the regulatory mechanisms. So your common medical research would focus on signal transduction cascades in the regulatory mechanism.
We’re moving into comparative analysis within eukaryotic organisms, microarray analysis, and probably SNPs too, in a general context. The tools to support the interpretation of SNPs are still not so good in my opinion. We haven’t done good ones yet, either, so I shouldn’t be saying anything; but these are things we look forward to. And actually, we at FIG are not going to do most of it. Within this collaborative effort we’re finding different people developing different tools, and they will integrate them into the Seed. They like the fact that they can take the Seed and build a new system on top of it. So it seems likely that it will be used as a component in a number of more complex systems.