Last summer, the American Academy of Microbiology gathered a group of microbiologists, biochemists, and bioinformaticists in Washington to discuss the challenges of prokaryotic genome annotation — specifically the fact that thousands of sequences have yet to be assigned a biological function. Last month, the AAM followed up the meeting with a report (An Experimental Approach to Genome Annotation, available at http://www.asm.org/Academy/index.asp?bid=32664) calling for a “central data resource” that would catalog those sequences that have yet to be annotated, along with functional predictions and supporting experimental evidence. BioInform caught up with Peter Karp of SRI International, who served on the meeting’s steering committee, to discuss the bioinformatics aspects of the AAM’s proposal.
Can you provide some background on how the AAM meeting came about last summer?
What happened is that Rich Roberts [of New England BioLabs] and I were independently pursuing two different ends of the same problem. The problem is that there are lots and lots of genes in sequenced genomes that we don’t know what the heck they do. We don’t know what their functions are, because typically when you use sequence analysis techniques to predict the functions of genes in newly sequenced genomes, you’re able to predict functions for only about half of them.
So there are two ways to look at this problem. One is to say, well, there are all these sequences and we don’t know what they do, and let’s use bioinformatics techniques to try and figure out what they are and have experimentalists test those conjectures. And that’s the angle that Rich Roberts came at the problem from.
And I came at the problem from a different angle. When my group does these metabolic pathway predictions for organisms, there are often holes in the metabolic pathways — there are pathway steps for which there is no enzyme identified in the genome. Well, the natural thing for us to do is to try and find other sequences for that same function from other genomes, and Blast them against this genome. But what if you don’t find any sequences for those functions?
So that’s the converse problem, which is that there are actually many molecular functions that biologists have identified over the years for which there is no known sequence. And no doubt, many of those sequences are in these lists of unknown open reading frames in the genomes.
So Rich and I came at this from the perspective that there are really two ways to approach this problem. You can start with the sequence and try to figure out what its function is, or you can start with the function and try to figure out which sequence goes with it.
I think the other reason the workshop was timely is that there’s been this very hot new area in bioinformatics that goes by a variety of names such as genome-context methods, which is a whole new class of techniques that’s been developed in the last few years for trying to figure out the function of a sequence whose function is not known. These methods work by trying to find what people call functional associations, which are based on the hypothesis that two proteins have similar functions or work together in the cell somehow. It’s a useful concept, because if you can find that there’s a functional association between a sequence whose function you don’t know and a sequence whose function you do know, that can give you a lot of clues for tracking down the function of the first one.
So what the report really says is this is an important problem — to figure out the functions of all these sequences — and there are two ways to go about the problem, and let’s have a large effort organized to tackle it. And it’s a place where experimentalists and bioinformaticians can work together very productively.
How does this proposal compare to some other genome annotation initiatives, ... like Ross Overbeek’s SEED effort for prokaryotes?
Ross is one of the people developing genome-context methods. So he’d be one of many people I think who would want to contribute to this effort from the bioinformatics side, but I think this report tries to focus more on the experimental side because I think the bioinformatics work is going to happen anyway. I think what people are concerned about is that there’s not an organized, systematic push by experimentalists on this issue.
And I think that some of the goals of the workshop were to ask what kind of database resource could we create to help pool these bioinformatics predictions to make it easy for experimentalists to find predictions to test that were within their areas of expertise.
And we also talked a lot about incentives for the experimentalists.
The report made some recommendations, but it didn’t lay out a roadmap for implementing any particular steps. Has anything happened along those lines yet?
Not that I’ve heard. I think this is the first step. My understanding of the way the government likes to work is they like to get people together to talk through an idea, and explore it, and see if it makes sense, and then issue a report that summarizes what they think. The ball is really now in the government’s court for the people in the funding agencies.
Do you envision any particular agency sponsoring a project like this?
I would hope it would be an interagency effort, because I think the missions of many agencies would benefit from this. It’s not obvious to me that it’s only within the scope or principally within the scope of any one agency. So I’d like to see them work together on it.
Tell me how the Enzyme Genomics Initiative that you proposed previously fits into this broader project.
I wouldn’t say that the Enzyme Genomics Initiative is something that we at SRI are doing — it’s something we’re more calling for. It pertains to the half of this initiative coming from the functions-without-a-sequence side, and also focusing on enzymes as opposed to other proteins. It’s the initiative that my position paper in Genome Biology called for [2004, 5:401], which is let’s find at least one sequence to go with all these enzymes that lack sequence.
And something that we’ve started to do at SRI, which the website that’s quoted in the report summarizes [http://bioinformatics.ai.sri.com/enzyme-genomics/], is we’ve been doing some research to find out if we manually hunt around in the literature, can we find sequences for some of these enzymes? That is, how much of the problem is really just due to missing annotations in the databases? We think that maybe six hours in the library can save people six months in the laboratory, and for about 20 percent of the enzymes we are able to find sequences for them.
So we’re doing a pilot project to look at a fraction of the enzymes lacking sequences, and I think what we found is that the problem is real, that we can find sequences for some of them, but the initiative is still needed. But it does make sense to do some literature research first before turning the experimentalists loose on it.
What kind of infrastructure do you envision for making this happen? Would there be a centralized database that everybody would deposit information in, and also use to see what types of experiments need to be done?
Right. And then deposit the results of those experiments. So once people validate a prediction, or if they get a negative result, both outcomes should go in the database.
Would this be structured along the lines of Genbank, or would it require something more complicated?
I think it would be quite different from Genbank. It wouldn’t even need to contain the sequences, I think. It could just link to Genbank or UniProt, for example. It’s more recording predictions that the bioinformaticians deposit. Things like what’s their level of certainty in the predictions, who made the predictions, and what method, and also for experimentalists — what results did they get.
One of the other neat things about having this database is that it could be used to help validate and refine the methods. So you could use the information deposited in the database to figure out which methods are the most accurate, what examples do they fail on, what examples do they succeed on, and that’s usually the kind of information you need to further improve your methods. So it would be real helpful for the bioinformaticians to get that feedback from the experimentalists.
What do you consider to be the biggest hurdles to making this happen?
I think the two biggest hurdles are funding and getting experimentalists to kind of rally behind the idea. I think a lot of them are used to thinking, well this specific problem is interesting to me, this enzyme is interesting to me for whatever reason. But we now have this global map of the unknowns that really needs to be filled in, and that I think is a kind of prioritization scheme that a lot of the experimentalists aren’t used to working within.
Has anyone proposed a structure that would incentivize the experimentalists?
Funding is one. If people see that there’s money in this area, that’s going to shift some people. I think the notion that students can do a lot of this work might mean that it’s not a big detour for some groups, ... And I think a lot of people will see how this initiative really will impact the big picture of our understanding of biology.
Looking down the road, what would you like to see as a potential roadmap for getting things rolling, assuming that funding is available?
I hadn’t thought about it quantitatively, but I think you could put a modest amount of money into it initially, get some experience with it, find out what works and what doesn’t work, and then start to ramp it up. Maybe in the first three years, we could see tens of experimentalists funded to work on this kind of thing, and start to find out what works, and how to scale it up. And then another three-year period after that, it could really be ramped up significantly.