University of British Columbia
Who: Leonard Foster
Position: Assistant professor, University of British Columbia, department of biochemistry and molecular biology, 2005 to present
Background: Assistant research professor, University of Southern Denmark, 2001 to 2004; postdoc work in proteomics, University of Southern Denmark, 2001 to 2004; Ph.D., biochemistry, University of Toronto, 1996 to 2001
Leonard Foster and Charles Howes created PrestOMIC, an open-source application for storing mass spectrometry-based proteomics data. Unlike repositories such as ProteomeCommons and the Open Proteomics Database, PrestOMIC allows researchers to store their data in an interpretational manner, Foster says.
A description of PrestOMIC can be found in the June 6 edition of Proteome Science. Below is an edited version of a conversation ProteoMonitor had with Foster this week.
How did you go about creating this database?
The basis is that in the past, when we’ve tried to publish large datasets in mass spectrometry-based proteomics, we have normally had to supply all the protein identifications in some kind of Excel table or something like that because it’s just not possible to present them in the journal article itself. That’s resulted in a few problems. One is that often reviewers ask you, and fairly I think, to present the data in a way that is more accessible to the reader rather than having to scan through Excel tables – to find some better way for their reader to search it if they have a particular protein that they’re interested in, then they can find it more easily.
In the past, in one of the labs that I used to work we built a searchable browser-based database system so that someone could do just that. But it was really built around the core database that we had in that lab. So one of the things I wanted to do in my own lab was build something that would be stand-alone and easily installed by anybody who wants to present a large database like that.
We’re coming out with, or hopefully publishing, several kinds of these things each year, and this will allow us to present that data sort of as an accompaniment to any journal article that goes above and beyond the regular supplemental table that people need to include.
Can you describe the kind of presentation that will be available to researchers now with PrestOMIC?
It’s a PostgreSQL database that basically provides a browser-based interface. There’s a customizable main page where different kinds of searching and filtering options are allowed, so depending on what the project is, that would have to be customized, but one thing that we’ve implemented is a Venn diagram system. If you’re comparing different types of tissues, then a Venn diagram presentation [will let] you look at the intersection or the union of various subsets of data.
[It also allows] filtering based on criteria that are specific to the experiment, so if there’s some kind of quantitative dimension to the data, [you can filter] them only for ratios that are above a certain threshold.
Also, there’s a Blast-based functionality. Let’s say someone is studying the same system as you but in a different model system, like if they’re looking at mitochondria in Drosophila and you’ve done your study in C. elegans, then they can take their Drosophila protein that they think is in mitochondria and Blast against the dataset in PrestOMIC, and hopefully if it was identified in the PrestOMIC presented data, then they’ll be able to find it.
It’s also got a function for Blasting specific peptides, so if you identify a peptide and you want to know if it may be in this dataset, then you can find it that way. And [it has] sort of standard search functions.
How unique is the Blast-based functionality?
Well, it just uses the Blast tool from NCBI, but it’s built into the whole server system. Everything in PrestOMIC was either created new and is available as open-source software or was based on other open-source or open-accessibility types of tools. It’s based in Linux and uses PostgreSQL, which is open-source, uses the Blast tool which is publicly available from NCBI, and it uses several other BioPerl functions that are open source as well, and it’s meant to be open access as much as possible.
So what is unique about PrestOMIC, then?
I think it’s the combination of the whole thing. Like I said, in one of the projects I was involved in earlier, we had developed something basically the same except there was no way to make it publicly accessible. [With PrestOMIC] you have to retrieve these other open-source things, but then the PrestOMIC core is the code that ties all these things together and allows you to use them in this format.
How does it work? Does someone go onto a website and dump all the data in?
There a couple of ways that it could be done, but there are some automated import functions, so you basically need to define how you want to structure the project. This is completely dependent on what the experiment is. You might have a project and within that project you might have different types of samples that you want to keep separate, so you have to define those higher level things, but then all the raw data from the actual mass spectra all the way to the protein and peptide identifications, all those types of information are inputted automatically.
Is the data then going to be accessible to everybody who’s interested in looking at it?
That would be up to the person publishing it. You can make some kind of password accessible thing. But really what we’re trying to do is make something that would be completely open to anybody who wanted to access it. What we’re trying to do is address this need from the journal’s point of view of somehow presenting this kind of data in a more user-friendly manner.
PrestOMIC would not be hosted on the journal’s website so it wouldn’t be like if you’ve got a subscription to this journal and you can access their articles, then you’d also be able to access this data. The data would be on another server, and our hope would be that it would be accessible to all.
But you can imagine maybe setting up some kind of internal site as well, but really I think its bigger value is in [making] it publicly available.
PrestOMIC sounds different from other open-source databases like ProteomeCommons, which is really for researchers who want to look at the raw data and possibly duplicate experiments. It sounds like PrestOMIC is really for the benefit of the journals.
I don’t think it’s for the journals. [It’s] to add more value to the journal article. The way it’s different from the larger data depositories is that they’re just that, they’re data depositories. They don’t really have any mechanism for showing any interpretation of the data. They’re mostly for depositing the raw mass spectra data, and then other people can mine that for whatever they want. This is more for [the accompaniment] of a specific article that you publish in a journal. Rather than you just writing up the article and then tagging on this Excel file that can often be uninterpretable to anybody outside of your own lab, this would allow a mechanism for people to access that data more easily.
When we publish a journal article, the higher-level goal that we’re trying to achieve is to present the data to the wider public, and I don’t think that’s done very well with the current way that these large-scale datasets are being published. You have 5,000 words or 10,000 words describing what you found, but there isn’t anything beyond that because it’s very hard to extract any information from the supplementary tables that come along with those things.
I agree that it’s maybe a bit of a fine line, but I think what this is meant to be is something that a lab installs to share their data rather than to deposit the large datasets into a public repository for other people to mine.
Is one of the goals of this to encourage researchers to share their data?
It is a goal. I’d like to say we’re trying to encourage people to do that, but I think that what I see coming down the tunnel toward us eventually, are reviewers or journal editors, [who] are going to require that when you want to publish a large-scale dataset like this, it will have to be deposited into ProteomeCommons or something. It won’t be enough anymore to present all the data in a table that goes along with the journal article because there’s really not a lot of value in that for a reader.
So eventually what kind of relationship do you see PrestOMIC having with something like ProteomeCommons?
It’s going to be complementary. There’s some data that obviously is going to be redundant, like when you get right down to the raw spectra. Those would be redundant, but it’s the interpretation of the data that PrestOMIC would make available whereas there’s really not any way to maintain interpretation of the data in something like ProteomeCommons.
What personal experience did you have that led you to develop PrestOMIC?
I’ve had two or three papers where reviewers have said ‘This is very nice, but how is the reader going to access this data?’
In the past, like I said, we made up a specific website to allow access to one of the datasets, but this is basically just generalizing that so anyone can use that. I’m sure I’m not the only one who has had reviewer comments back like that.
Does this change in any way how the research is going to be done?
I think one of the other things that would differentiate PrestOMIC from [other data repositories] is that ProteomeCommons is more addressed at other proteomic researchers, making all that raw data available to them for whatever purpose, whereas PrestOMIC is kind of aimed at other proteomic researchers, but aimed more at other biologists in the area of whatever dataset it is that you’re publishing.
A biologist who has no experience with mass spectrometry is not going to make any use of something like ProteomeCommons, whereas PrestOMIC allows them more easy access and higher-level access to the data. So hopefully, it would make data from proteomics projects more accessible to the wider community outside of just proteomicists.
And where do you hope that will lead?
Our hope is that it would allow proteomics research to have a bigger impact and a wider impact outside of just proteomics. My impression is that often proteomics studies, or any other large-scale dataset or large-scale data generating method is often not very accessible to people outside of the field because they don’t have any way to evaluate the data or extract any meaning from it.
If they’re outside of the field, would they necessarily be interested in the data?
I think so, because let’s say you’ve identified a protein in mitochondria or several hundred proteins, and you’ve quantified the differences between apoptotic cells and non-apoptotic cells, then having the data presented in this way would allow some physiologist who has some interest in mitochondria metabolism to look at that data more easily than just trying to wade through big tables of protein ID.
One of the advantages of PrestOMIC is that it can be modified as proteomics standards changes. That can be fairly important since the standards are still being set, no?
That’s right. We could wait another five or 10 years until the standards start to settle down, but we prefer to publish it now and let it evolve rather than wait and not have something like this available.