At A Glance
Name: Chris Taylor
Position: Software Engineer for European Bioinformatics Institute, since 2003.
Background: Research associate in bioinformatics, University of Manchester, 2000-2003.
PhD, Population genetics models, University of Manchester, 2000.
How did you become involved in the Proteomics Standards Initiative? Do you have a background in computer science?
Basically I started out as a biologist, and from there, I wasn’t sure what to do with my qualifications. I looked around for a PhD and eventually did one in evolutionary biology — it was a theoretical thing involving computing. Then that was completely inapplicable to anything at the time, so really it was a bit of a loose end. There was a collaboration between several universities in the UK, and they were doing some proteomics and transcriptomics stuff, they needed a database for proteome data. So I got recruited even though I had almost no computing skills. Biologists are a lot cheaper than computer scientists. This was in Manchester.
So that job pretty much involved building a database, and we got distracted from that main purpose pretty quickly because when we started modeling work flow we might have to capture for this database we realized that what was going on there was really quite general of a model. And so we became aware that there was this standardization effort by HUPO and we took our model down to the meeting. Because we had published in March ‘03 in Nature Biotech, the people were aware of us as well. The Proteomics Standards Initiative people approached us and said,’Would you like to come down?’ — this was in end of ‘02 at the first PSI meeting that we went down to. After that there was another PSI meeting we went down to.
Before working with PSI, your background was pretty much in evolutionary biology?
Yeah, completely. It has no application. Most of the work for the PhD was sort-of simulation. It was basically writing really scrappy C code and nothing else. It was just enough to get it to run basically. It was a kind of bare bones, worst case scenario C code. So I was never a programmer. Since then, I’ve gotten immersed in bits of it, partly through doing the database research for that first job in Manchester — SQL and things like that. And also what’s been more important is UML, which has been important in modeling, and very handy when you want to talk about schema and things like that to experimentalists — it’s nice to have a brightly colored picture to wave.
What were you doing when you first went to the EBI?
We were still working with the PEDRo model. The role was not that we were going to evolve it into something we wanted. It was more — we used it as a scaffold to put the model on. PEDRo had the right scope but was the wrong sort of model. In terms of scope, it got all the right bits in there somewhere. In terms of structure and the design of it, it wasn’t exactly what was needed, partly because it had a lot of very specific details in it, and that’s the kind of thing that really dates very quickly in terms of technologies for proteomics. So what was needed was a much more generic sort of model, and it needed to capture more work flows, things like that.
What we’d do is we’d take PEDRo around to people and say, ‘React to this. This is something approximating a solution and we’d like to hear your response to it.’ And that feedback was fed into the design of the new model.
What did you do after you gathered all that feedback?
It kind of all came to a head at this April meeting earlier this year, where there were several presentations. One thing we did at Nice was got developers from a similar sort of effort for transcriptomics, which uses the MAGE markup language. They’ve put a lot of design into that and they’re essentially doing a similar sort of thing because they also have reporting requirements as we’re generating, so they had a very similar suite of things. And they messed loads of it up along the way. So we got a load of those people down to talk to us about the general design process and how not to mess it up. We also had people give presentations about similar sorts of efforts to what we were trying to do, that included PEDRo obviously.
And then we just started breaking up into round tables where we’d get a bit of paper and just start scribbling out classes and seeing what we thought of them and then arguing over names, and by that process started to evolve a bit of a model. A couple of bits persisted from PEDRo that seemed fairly sensible, like the idea of having cycle processing of a sample so you could do what you wanted — you could take from a menu of different methods if you like and then you chain them together to produce your work flow. That’s a reasonable, sensible thing so that persisted across the new model. Some other bits, like the contextualization of the work, like for instance your work is more likely than not going to be part of a larger body of work so it’s going to be part of a run of experiments, maybe between different -omics or different parts of the body or things like that — the idea of modeling a project more fully was something we went into in a bit of detail. And also, the kind of statistical analyses you might run. So some of it’s similar, some of it’s quite different, and what’s different started to be generated at this meeting in Nice in April. And since then we’ve basically been putting stuff up on the website as we’ve gone. So this is just specifically around the model.
There’s more to it. If we go back to PEDRo again, there’s kind of three things there in one. There’s a model in the sense that you can derive a format for capturing data in. But also embedded in there is a reporting requirement in that some of the information it wants is considered to be compulsory and some is optional. And there’s also the beginning of an ontology in that there’s an awful lot of terms in there and they’re defined and grouped according to the type of term that they are.
So we’ve split those three things apart now. The model that we’re generating and the file format derived from that will be really quite flexible. To accompany that is this ontology which is basically an elaborate dictionary of terms which you use in the model. And the reporting requirement is kind of,’What’s compulsory, what should you tell me about?’
Are the reporting requirements you’re talking about the MIAPE, or minimum information about a proteomics experiment?
Yeah. That’s really going quite well now. We’ve got kind of a mission statement document up on the website that outlines the general structure and guidelines, and associated with that parent document are a number of modules that deal with specific technologies. So there’s one for mass spectrometry for instance which is quite mature now. There’s one on the way for gels, there’ll be another one for chromatography and another one for sample design.
What else besides MIAPE are you working on for PSI?
There’s the ontology — really it’s there to serve as a well-defined dictionary of ideas. And it’s not just proteomics. There’s essentially three main -omes: transcriptomics, metabolomics, and proteomics. And what we first need to do is, where there’s commonality between those three areas, they should share ontology models and reporting requirements. So we’re involved with the transcriptomics people. It’s a bit harder to pin down who the metabolomics people are, but we’re working to do that. So we’re coordinating now, and this is an ongoing thing to see that where we could share ontology and model, we’re aiming to do that.
There’s also sort of a parallel effort now to look at those common high-level things like descriptions of a project and descriptions of how you generate a sample in the first place. In the context of all these different ways of doing science, you need to be able to expand the whole ontology, reporting requirements to sort of customize it for their own community. So for example, for a medic looking at a toxin in an environment, they’d recruit patients to a study; if a toxicologist was looking at a toxin in an environment, they might do different sorts of experiments, and we want those two sets of people to be able to share an ontology and model and reporting requirement where possible, and you’d want them to be able to tailor those things to their own needs. That’s all quite high-level stuff, and it’s going on between PSI and the Microarray Gene Expression Data society.
What are you looking to work on in the future?
We’ve certainly got our work cut out to basically work our way across the reporting requirements. Just about every technology you’re going to come across in a proteomics workflow there are accepted community guidelines for reporting. We’ve got MS guidelines to the point where they’ll go off to expert committee, and when the expert committee signs off on them, we’ll consider them pretty much done. We’ll then work through gels next and then as I say, work our way up so there are guidelines for all the different technologies in proteomics. That’s potentially the most far-reaching because these guidelines you can apply in all sort of ways — they’re independent of any technologies for capturing the data. The modeling, we hope will be more or less done by the summer. Then the ontology has to be done as well. This will probably more or less be done by the summer. Then we’ll move into supporting what we’ve done and promoting it, and looking at these kind of wider collaborations and moving forward on this idea of a joint model between the different -omics.