Chances are, if you have anything to do with bioinformatics, you've heard Ewan Birney give a talk on Ensembl, the genome browser and annotation pipeline that his research group at the European Bioinformatics Institute supports along with Tim Hubbard's team at the Sanger Institute.
Birney's seemingly tireless efforts on behalf of Ensembl — and as an outspoken advocate for other open source bioinformatics projects — were recognized last week when he received the Benjamin Franklin Award from the Bioinformatics Organization, which is presented annually to members of the bioinformatics community who have promoted open access in the field.
Birney's work on Ensembl has earned him further recognition this year from the International Society for Computational Biology, which selected him as the 2005 recipient for the Overton Prize, which will be awarded this summer at ISMB in Detroit.
But while Ensembl may be the most high-profile effort underway in Birney's group of 25 researchers, it's far from the only one. BioInform spoke to Birney after the award ceremony at the Bio-IT World conference last week to find out a bit more about some other projects in his lab that don't get as much air time. The following is a transcript of the interview, edited for length.
You and Lincoln Stein [from Cold Spring Harbor Lab] launched a new pathway database last year called Reactome. Can you provide an update on the project?
It's very simple. We know a lot, because of great work over the last 30 years, about how things are put together, but we don't have that knowledge in a computer-accessible form in an open way. This is both in metabolism, which is what most people think about when they think of pathway databases, but more importantly it's about signaling pathways, and other things.
There [are] lots of people in this area, but there's not a strong culture of openness, and I think we're bringing some of that, which is good. And the other thing is that a lot of people [creating pathway databases] focus on pretty pictures. There's nothing wrong with pretty pictures, but computers can't use them. So it's not good enough to just have these nice diagrams if you don't actually yet have a structure that represents that diagram correctly. I think a lot of people don't appreciate the difference of getting the data in right so you can use it sensibly computationally. Reactome is one of the few databases that is doing that.
We're really open to collaborate with a number of the other pathway databases, in particular Peter Karp's Cycs, and Paul Thomas's Panther pathways from Celera. We're trying to solve the same problem. There's a lot of space, and there are many, many things to share here, so we're hoping to do that.
We've just started thinking about using Reactome. It's starting to pick up, and I'm starting to hear of a lot more people using it. But it's just started.
There are so many different pathway databases now for people to choose from — STKE, BioCyc, even BIND and some of the interaction databases are considered pathway databases.
Absolutely. But if you put us in that space, we're not high-throughput. We're capturing what's believed to be real knowledge. We need to get the visualization right, but fundamentally we want to capture it in a way that you can think about computing with.
So you could take a set of gene expression data, say, and map it to the information in there?
Exactly. That's important. Just to give an example, there's a protein called IRS, and it's IRS1, IRS2, IRS3, and IRS4. When they draw that diagram, most people don't like putting four blocks there, because they all do slightly different things. So they either call it IRS or they call it IRS 1-4. But of course they're four genes, four places on the genome, four different chip spots. But the mundane business here is that if you have a picture with IRS, now I can't link that to my gene expression data. So you have this horrible name-munging business.
In Reactome, we explicitly force everything to be tied down to clear-cut identifiers. So we use UniProt accessions as our definitions of proteins. If it's not in UniProt, it doesn't get in. We don't come up with new definitions for molecular function, we're using GO molecular function; we don't come up with new definitions for cellular compartment, we're using GO's. We're participating in the Chebi project at EBI for small molecules, so we don't use our own definition of small molecules, we're using the Chebi definition. So in all these cases, we're building the graph of how it's put together, and not the parts list.
GO is building the vocabulary to describe these parts, and Chebi is building the set of small molecules, UniProt is the set of proteins, but [we're building] the graph. And we're not doing it in a high-throughput way, we're getting an expert, and they get paired up with one of our curators, and they take a whole area, like DNA replication, and we get the full understanding of this in.
There are lots of details that we're very obsessed about, which I think helps this. For example, we strictly separate out the species. So if it's a human reaction, they have to be all human proteins. If it's a yeast reaction, they have to be all yeast proteins. … The ability to separate out species is quite important, because a lot of pathway groups kind of have a pan-species view of pathways, and then each species takes a subset of those pathways. And we're not doing that. We can very much have two very separate pathways that partially inter-relate in different species and be different. So for the first time, I think quite genuinely we can look at the evolution of pathways in terms of not the evolution of the proteins in the pathway, but the genuine evolution of how the connectivity changes.
And that is also very exciting, and it looks like we're going to be working with a bunch of people. We have a very human focus, so at the moment we can't really do this because we don't have another species to compare to, but because Reactome is open source, we're encouraging many other groups to use it, and it looks clear that the Drosophila group will also use it for curating their signaling pathways, and that's very exciting. Because although there's a lot of commonality between Drosophila and human, there's also an awful lot of differences, and I think we overstress the commonality in some ways, and we also need to investigate the differences.
And it also looks like some bacterial groups will also be using Reactome to put in some bacterial pathways.
So, this is me and Lincoln [Stein], so we encourage people to pick it up, and I think we're starting to feel that we're building a community outside of just ourselves. So I think Reactome is at a cuspy phase right now.
So when you say people can pick up the software, is this along the lines of the GMOD project, which is building tools for people to create their own model organism databases?
That's right. But one thing about pathway data is that, sadly, there's no ABI machine where you can just chop this up and pour it in and it comes out. So this is about capturing true understanding of these things, so it's yanking information out of people's heads. And that is the hard part. The software is actually much easier. It's the process of getting things into the database that's hard.
So we have a lot of focus on data population. Our data model, we're adapting it to be better in a whole bunch of different areas, but, fundamentally, that's not the hard part of this project.
So when we give it out to other people, it's also telling them how to populate data, and, also, to be brutally honest, managing their expectations: There's a piece of software, and it will work, but it doesn't do it for you. It can help you, but it doesn't do it for you.
I think that in a year or two, we'll be building more things around Reactome. In this area, I want to be working closer and closer with experimental groups. I think for this kind of thing to be successful, there's got to be a lot more engagement of high-end bioinformaticians with high-end experimentalists. We can't be separated. With the genome sequence, it was easier to be separated because there was a very clear product that people wanted from us. So you could argue about the details, but not the broad goals. When you start dealing with pathways and networks and all these things, your intuition is less good at leading you. So we've got a much richer set of experimentalists that we're working with.
You mentioned recently that you're also focusing on other research outside of these database projects.
A lot of what happens in those groups blends into research. In Reactome, working with other experimental groups really is research. It's not engineering. But then I also do have outright research groups who are looking into new aspects. That's very exciting. It's my students and some algorithm development as well. Guy Slater in my group is a very talented postdoc, and it's taken him three years to make this new sequence-matching algorithm called Exonerate, which was recently published. We've been using it in house at Ensembl for a long time, about two and a half years. But Guy didn't want to make it a 1.0 release until it all worked.
So it really is the next step forward in sequence matching. There are these model-based algorithms like GeneWise that match things in a sophisticated way, but go very, very slowly, and there are a whole bunch of things that go very fast, like Blast and Blat. And Exonerate is really a fusion of these two things — a formal way of applying heuristics to any complicated sequence-matching model. So the breakthrough here was Guy working out how to formalize the process of applying heuristics to come up with an entirely new heuristic framework called bounded sparse dynamic programming. Basically it puts my old program GeneWise into the retirement home. It's about a thousand-fold faster.
Also in my group, we're looking at evolution. One of [the projects] is multiple alignments of genomes. And we're doing this in a very formally probabilistic way, which we think has a lot of benefits, especially in the fine details of the base pairs being aligned correctly.
And at the same time, we're looking at very close evolution between species as well, and asking some pretty basic questions, actually, that need to be chased down and answered. One of these is, people use this concept of neutral DNA. There's a definition of these four-fold degenerate sites in coding sequence — the glycine codon, the last nucleotide can be anything. So when you're doing short evolutionary distance, there ends up not being enough of those sites, so you end up using introns. But people haven't really asked the very simple question of, "Is it really the case that these things evolved in the same way? What are the different types of constraints?" Most people would happily say this is neutral DNA, so you predict that they're evolving in the same sort of way, but of course that's not in fact true. In lots of cases, it's pretty much in the same ballpark, but it's not nailed. And there are lots of other things that are going on in this. So, we have another student looking at that aspect, just so we can know what we're doing as we move between different measures of evolution.
It's a bit technical, but the short answer is that understanding very short-scale evolutionary events requires developing new techniques, and we're doing the investigative work at the moment, and I hope to switch that into more discovery-style work.
EBI has funding for a new building now, so what does this mean for your group?
Part of my group has gone out into the portacabins, so I'm delighted that it's going to be able to get back together. We're growing. EBI has always participated in joint projects, we build community standards, so many people are using things that are grounded in the EBI. GO is a good example; MGED, the MIAME standards, were really led by Alvis [Brazma]; all these different things.
So I think another thing that's going on is that it's not right to plaster EBI all over MGED or all over GO, because they are genuine community projects. And yet, we perhaps lose some of the recognition that goes with that. Another good example of that is UniProt.
I think each of these projects has a lot of name recognition, but it's the organizational level that we don't perhaps play up. There's just this incredibly strong, open standards-based bioinformatics coming out of the EBI, and you can see those fruits migrating through huge amounts of bioinformatics, so we need to be a bit prouder of that.
So it's great that we're going to get a new building. Space has become a big issue. But another thing at EBI is that we're well-funded now, but that's short-term funding. And we feel that for resources where you know there's a long-term need for them, and there's agreement that one has a long-term requirement for them, it's kind of crazy to ask us to pretend that it's all being reinvented every five years or every three years.
That seems to be an ongoing problem in this field.
I can appreciate it from the funding agency's position. We can't just ask for a blank check to be sent every year, so there has to be oversight, and we're happy to have firm oversight on this. But in Europe, the funding agencies all agree that this is an important problem — and that they shouldn't solve it.
At the moment, there are some pretty productive discussions with the EU. Currently the EBI stuff can't really go into something they call scientific infrastructure — polar exploring ships, and these sorts of things. That's what that [category] was originally designed for. So I think there's a desire to see that work out on all sides. Quite genuinely I do believe that there are people in the EU who want to make places like the EBI be able to apply for infrastructure funding. When you say it, it seems blindingly obvious, but of course there are many details, and so that's probably the next principal challenge for the EBI — how to change this success into a long-lived success.
But getting the new building is a sign of people recognizing that EBI has an important role to play.