Senior Software Engineer
European Bioinformatics Institute
The European Bioinformatics Institute has been an instrumental player in developing data standards in the fields of gene expression analysis and proteomics. The institute spearheaded the MAGE (microarray gene expression) consortium that developed the MIAME (minimum information about a microarray experiment) reporting standard and the MAGE-ML and MAGE-OM formats for gene expression analysis. More recently, EBI researchers have played a leading role in developing several proteomics standards for the Human Proteome Organization's Protein Standards Initiative.
Now, the EBI has thrown its hat in the metabolomics ring in an effort to establish a common set of data standards and formats in this quickly emerging "omics" discipline. The institute is co-hosting the so-called MetaboMeeting July 18-19 in Cambridge, UK, to discuss current and emerging standards in the field.
BioCommerce Week's sister publication BioInform spoke to the EBI's Chris Taylor this week to get a better idea of the motivation behind the meeting, and some of its specific goals.
What was the motivation for the upcoming MetaboMeeting? Can you discuss the current status of data standards for metabolomics?
I'm actually primarily involved with proteomics standards. The work I get paid to do is to play a part in a large collaboration to generate standards for proteomics — XML formats for the transport of mass spec data, gel electrophoresis data, things like that. We also are part of a large collaboration to work on an ontology whose scope is all of what's now being called functional genomics. We could say this is sort of the premier league omics if you like. There's God knows how many omics knocking about, but proteomics, transcriptomics, and metabolomics would seem to be the three major ones in terms of general biology.
The transcriptomics people got their act together quite a while ago now with MIAME for reporting, and the MAGE format for capturing data, and they had an ontology to support the use of the format in fulfilling the reporting requirements, so they now have this suite of things that work very well together. Now we essentially sought to copy that approach for proteomics, and it's gone quite well, we think.
So partly because several of the transcriptomics and the proteomics people are all based at the EBI, and partly because we all see each other at the same sort of meetings all throughout the world, and mostly because we have clear shared interest in coordinating these data sets — because obviously the goal is the biology and not the technology — we very quickly moved to collaborate with the trasnscriptomics people from the proteomics point of view, where we thought there were commonalities.
It's nice to have this lever point where you can say, 'Look, there's a much bigger picture here.' So you can go to the metabolomics community and say obviously there are strong arguments for you to share data amongst yourselves, and we know that they subscribe to those arguments fully — but there is also this much large picture, and it's wise to play a meaningful role in the design that will take place in this larger collaboration of where can we have common ways to capture data, where you can use a common ontology and common reporting requirements, we really want these people to be contributing from the outset because if there are different approaches that we need to be aware of, then it would be nice to be aware of them.
So there's an ongoing effort to develop a very generic model from which we could derive formats and repositories, things like that to underwrite any kind of technological data capture, which is something called the FUGE model, for the Functional Genomics Experiment model.
Now, it's very generic, and it's very much a sort of techie thing, and God forbid any biologist ever came across this in its raw form, because they'd run a mile. But what that will allow you to do is derive a series of much more specific-looking formats for all sorts of applications. So this could be for data captured from mass spec for metabolomics or proteomics or captured from array transcriptomics data.
So, to move to the specifics, I'm as I said stepping out of my immediate domain by straying into metabolomics, but it seemed that there was no single, unified effort in the UK, although there were a number of collaborative projects in various areas, so we had a meeting last March where we invited some people to EBI. … So they came and said, 'Yes, we think there is scope for standards to support data sharing.'
Do any standards already exist for metabolomics data?
For the reporting requirements, there are a couple of [ongoing efforts]. MiaMet is one that I'm aware of, but there's also an effort by the SMRS group, which is the Standard Metabolic Reporting Structures group, and that's a large number of pharmaceutical firms and a smaller number of academic institutes, primarily from the UK. What they did in the first instance was to think about what's the important information — regardless of formats or ontologies or any other considerations — even if you were reporting with pencil and paper, what is the information you should cover?
NIH is also having a meeting in early August, and the Metabolomics Society just had a meeting in Japan. So there were these three meetings very close together, of which we're in the middle. So we're trying to do this joining-up thing early on as well.
But the reporting requirement project is underway now, and the paper will come out, people will respond to the paper, and over time — as it always is with these things — the community will arrive at some agreement. And we would then hope that the journals and funders would follow up on that, and would enforce that.
So the reporting requirements thing is a work in progress, but the other two parts of it — the generation of formats and the generation of ontology terms — are only in the early stages, with the exception of the SMRS document, which Nature Biotech just published.
What is being done in those areas?
There's some work [that has] been done. For instance, I would argue strongly that the work we did in proteomics for the capture of mass spec data and the analysis of mass spec data — one format that we have is something called mzData, and that's been implemented by several vendors — Bruker, Thermo, ABI have already implemented it and some others like Waters are in the middle of implementing it.
However, the format is really quite generic. It basically says that a mass spectrometer has a start, a middle, and an end, somebody ran it on a particular date, and it generated this big list of numbers. And you can annotate those numbers. So we saw no reason why that wouldn't be useful for metabolomics.
Now, I will promote that as being one possible part of a suite of formats. Another group from Cambridge is in the middle of producing a paper, but this is a schema that captures a large amount of data about NMR experimentation. But there's no available standard at the moment for that.
There are some other efforts. For instance, there's a project called ArMet, which is a kind of top-down overview of a metabolomics workflow that can again capture quite specific data resulting from these various instruments. But again, that's a continual development type of thing.
So in terms of the state of the art, that's about as good as it gets. There are some formats, but with no real implementation, some of them are developed but it's up for discussion whether they're directly applicable, such as the stuff we're producing at PSI, and some of them are in kind of live situations capturing data, but parts of the schema are evolving.
And the thing about something like ArMet is that it was developed with a very specific use in mind, which was to support data capture under a particular publicly funded project in the UK, which is something called the MET-RO project. They had a need to do this informatics provision, and this is what they provided.
So whether or not that's sufficiently generalizable to become a general standard, and whether or not, for instance, NIH would want to recommend that, and whether other sources of funding in the US would want to promote that is another issue. Certainly, with the way we did the proteomics stuff, from the start we found just about every single stakeholder we could think of and made sure they were at least aware of what was going on, and if they didn't want to participate then that was fine.
But often the political side of this is more important. At the end of the day, most formats that you could come up with are essentially equivalent — there might be a little bit more effort to implement them, or they might last a little bit longer or require less change over the long term, but really you can cope with these little differences and difficulties in the implementation. What you can't cope with is if somebody felt excluded from the process, and therefore is minded not to adopt.
So it takes a large-scale public open effort to recruit everybody at the start and make sure that this is a properly inclusive process. So certainly I think that needs to be the way that we will progress.
What are the goals for the upcoming meeting?
The aim of this meeting is first of all networking. It is important that the right people are all talking to each other — or that they're at least aware of each other's existence and the kinds of things that are going on so that they can follow up on any interest they thought they might have had, or not.
So in the first instance, this is just about getting everyone together, cracking a few bottles of wine, having a meal, presenting to each other and stuff like that. What we would hope would come out of it is that people comment on some of the existing efforts — that would be a first desirable goal. Another would be that we can draw up in discussion a couple of lists. The first one would be: where are the data coming from? First of all, almost trivially, what are the instruments that are generating the data? Less trivially, what are the sorts of biology that are driving the use of those instruments? The second list would be what are the uses to which the data is put — what kind of analyses are run over it, why are you doing the work, what sort of conclusions are you trying to examine, to potentially support?
And when you know where the data is coming from and the uses to which it's going to be put, then you can start thinking about the engineering that supports that — formats and ontologies and things like that. So we would then hope as a final goal to plug this particular community into a broader effort, which is this kind of multi-omics effort.
Mass spectrometers are used for both proteomics and metabolomics, so what differences will be required between the PSI standards for mass specs and metabolomics standards for mass specs?
One of the differences is that a lot more gas chromatography is done for metabolomics, and it's all liquid chromatography for proteomics. But this is before you even get to the mass spec, so that's a job for a different format. When we're talking about the specific use of a mass spectrometer, we thought about this for quite awhile now, and we actually can't see a difference. You have an instrument that you can describe because the format really puts no restrictions on the way in which you describe that instrument, other than that is has a source of ions, it has some mass analyzers, and it has a detector. Then you just capture the spectra — that can be anywhere from the profile data that comes straight off the instrument through to heavily analyzed peak lists. Now, some analysis of those peak lists is proteomics-specific, but a lot of it isn't, so again, there's no real prejudice toward proteomics there, and the only other thing it lets you do is annotate the peaks, but again, it places no restrictions on what that annotation could be, how it might be structured.
Obviously, NMR is a different kettle of fish and its something that is primarily a metabolomics tool as far as I'm aware. And other techniques as well — crystallography is mostly a proteome-based thing. There are different techniques for different places. There's also commonality between transcriptomics and proteomics in the idea of arraying. So whether you have these DNA arrays for transcriptomics or protein arrays for proteomics — in terms of the generic description of an array, there's commonality between those two omics, and we suspect that there are several commonalities between preoteomics and metabolomics. I'm sure there are others.
But in terms of general use of lab equipment, this is where something like the FUGE model can come in and can act as, first of all, the kind of foundation for more specific formats like mzData or any NMR format that might come along. In terms of the software engineering sense, these specific formats inherit from, or are a subclass of these generic schemata.
There's also project design. You could be doing multi-omics design in several different tissues, so you'd have maybe sort of transcriptomics or proteomics of the liver, all sorts of different types of things that you would want to somehow structure in some general description of a project. So in PSI we thought about this quite heavily, and what we've done is held off for now, because we've talked to the transcriptomics people about exactly what it is that they think is common between the two and where it's so generic that it has no omics specificity.
You'd also want to factor the metabolomics people into that, because they can have a different take on that kind of thing. And again, something like this FUGE model can potentially capture that, because it may not be case that you can come up with a rigid structure that is always applicable, because projects have many different layers to them and sometimes they're massive collaborations between sites distributed globally.
But none of these formats should ever exclude the small person from using them — you shouldn't need to be in some global collaboration of hundreds of people doing three different technological approaches to be able to use the stuff. It should cater to the little person as well.