Name: Lennart Martens
Position: Professor of Systems Biology, Ghent University; Group Leader of the Computational Omics and Systems Biology group, VIB
Background: PRIDE Group Coordinator, European Bioinformatics Institute
In April 2011, the National Institutes of Health's National Center for Biotechnology Information closed its Peptidome proteomics data repository due to funding trouble. Shortly after this closure, an effort to migrate the Peptidome data to the European Bioinformatics Institute's PRIDE database was launched by researchers from both institutions.
This process, which involved the transfer of data comprising roughly 53 million mass spectra, 10 million peptide identifications, 650,000 protein identifications, and 1.1 million biologically relevant protein modifications from 28 species and more than 30 different labs, was detailed in a paper published last month in Proteomics.
Accompanying the paper was a commentary by Ghent University researcher Lennart Martens, one of the original developers of PRIDE, in which he applauded the effort and outlined certain lessons to be taken from it.
ProteoMonitor spoke to Martens this week about his piece and the current state of proteomics data sharing more generally.
Below is an edited version of the interview.
In your commentary you mention the "heroic efforts" required to move the proteomics data in Peptidome over to PRIDE. What, in fact, are the challenges involved in such an effort?
This is actually quite nicely documented in the paper that [my] commentary goes along with. The structure in which the data had been stored and captured was slightly different. And this was at two levels. The first level was the difference in actual database structure — so which fields were captured and are all the necessary fields there. The second thing was the annotation of the data. The annotation of the data in PRIDE was performed primarily through controlled vocabularies, which means that certain fixed wordings are used from lists, whereas in Peptidome this was not necessarily the case.
This means that mappings have to be made between the annotation technology in Peptidome, which was slightly more flexible than in PRIDE, and this had to be done manually, and then scripts had to be run to do the format conversion and the annotation conversions based on these manual mappings.
Then, of course, each dataset has a slightly different set of reported parameters, which is a bit of an issue in the [proteomics] field, and in many [omics] fields, in fact. There are minimal reporting requirements that are not always adhered to, and then people also supply additional information, which is wonderful, but what kind of additional data is present differs between the different submitters. So you build a script to transport one particular dataset, and it turns out that the next dataset has a field that you were not expecting, something new that the researchers added. So then you have to take care of that.
The third problem is interesting because it relates to an issue in the field proper, and that is protein inference. If a particular peptide can be matched to more than one protein then the way this is reported differs very strongly between groups, and the way that Peptidome handled it is different from the way PRIDE handles it. So these kinds of things [from Peptidome] have to be shoehorned into the [PRIDE] schema even though it wasn't reported in a way that is optimal for that schema. So there are quite a lot of challenges.
Is some loss of data inevitable in this shoehorning process?
I think when it comes to the data and the metadata everything can fit into PRIDE. PRIDE is really flexible and was built from the start to accommodate changing requirements over time. So I think there, there was zero loss. However, when it comes to issues the field is still struggling with, such as protein inference, it may well be that in certain cases a particular nuance may be lost, because PRIDE doesn't deal with that in a particularly sophisticated way right now, and it may be that Peptidome captured more nuance there.
The issue there is that no one knows to what extent the nuance is useful. [For instance,] if you take a bunch of peptides you identified in 2005 and you look at the proteins you identified with them in 2005 and you then take that set and remap it to a current sequence database, you might get a different list of proteins because the sequence databases are in flux and the proteins today are not necessarily identical to the ones in the past.
As you noted in your commentary, some people question why we need to store all this raw proteomic data, particularly from older experiments, given the rapid advances the field has seen. How would you respond to that?
I think that scientists, especially in the life sciences, have always been collectors. So we keep track of stuff. We keep track of everything. The botanists and explorers, they kept samples of everything... and this forms the basis for future research. So as long as the cost of storage is not excessive, as long as you can store it relatively easily, why should we throw any of it away? Because you never know what you might extract value from in the future, and there is a lot of stuff you can only do if you have a sufficient amount of data from a sufficient number of experiments.
We recently got a paper out in the Journal of Proteome Research [on a method of] predicting trypsin cleavages. Now, if people had told me six or seven years ago I would write a paper on predicting trypsin cleavages I probably would have given them a long look and asked them if they were insane, because at the end of a shotgun proteomics approach you really couldn't care less about this: you get on average 50 peptides per protein, which means you get plenty of chances to pick up a peptide, and whether or not you are able to predict exactly which one you are going to see is not that big a deal.
But in targeted proteomics it becomes a very big deal, and so we were able to predict which peptides you would see relatively accurately from any kind of dataset purely because we had more than a few million [peptides] to train on from PRIDE. And some of these may very well have been low quality, but that doesn't matter, because you have a huge amount of data and the machine learning algorithms that are increasingly employed in proteomics thrive when you give them a lot of data. So keeping the data to me is very simple. If you can afford it, why not? And just to make the comparison to next generation sequencing — the storage needs that these people have for keeping track of their sequences eclipse the storage needs for raw proteomics data by such a huge amount that there really is nothing to worry about in terms of storing proteomics data.
In your commentary you cite four examples of studies reusing proteomics datasets. All four were published in 2012. Was this just coincidence, or did the field see a recent shift towards such experiments?
The reason I started [PRIDE] in 2003 was precisely because I wanted to analyze a lot of data, and I thought that it would be good if everyone gave their data to the community and the community could do something with it. There are two ways that you can approach proteomics data reuse: the "poison cup" approach or the "goldmine" approach.
The "poison cup" approach means everyone always mistrusts everybody's else's data because we know all the failings of proteomics and all the possible issues and that there is always going to be 5 percent nonsense in there. The "poison cup" approach is that if there is a little bit of poison in your cup you aren't going to drink it. And that is the approach that most people took for a long time. I can show you a lot of reviews I got back on attempts to reuse proteomics data. I got this one manuscript submitted to Nature Biotechnology that after three review cycles was rejected on one point only – the fact that I could not show them an exact false discovery rate on all the data sets I was analyzing. So that is the "poison cup" approach, and up to a year ago that was by far the dominant way of thinking when you looked at the literature.
Now the mindset of people has been changing to the "goldmine" approach. You assume there is going to be quite a bit of rock around the gold, but that doesn't mean you don't want to invest the effort to go after the gold. You just need to ensure that you have a pipeline set up to separate the gold from the rock. So you now see [University of Dundee researcher Ron Hay] looking at ADP ribosylation in Nature Methods, you get ... [Technical University Munich researcher] Bernhard Kuster's group looking at this particular form of glycosylation in [previously obtained] phosphoproteomic data. And then you see [Seattle Children's Research Institute researcher] Eugene Kolker with the [Model Organism Protein Expression Database] MOPED and [Swiss Institute of Bioinformatics researcher] Christian von Mering with [the Protein Abundance Across Organisms Database] PaxDb. You get [those four] out in [2012], which is very interesting, showing that this is an idea whose time has come.
Has anyone attempted an economic analysis to show the benefits of proteomic data reanalysis? To say, for instance, this is how much extra research, in terms of grant money, can be extracted from existing datasets?
I think as an approximation it would be possible. I don't know if anybody has ever done this. I haven't read all the grant proposals, but I could imagine somebody doing it [in a grant proposal], but I have never used an economic argument for this. I think that could be difficult because I think you would upset as many people as you pleased. The proteomics community, the people who run mass spectrometers, might be quite upset if you tell them they are expensive compared to what you are doing. So I don't think I have seen anything like that, and I would be reluctant to try it myself. It would be a fun exercise. I just wouldn't want to brag about it.
Anybody who reuses data has to be careful about being branded a parasite. People feel strongly about this. Any analytics person who has a beautiful instrument generating beautiful data, they can feel a little bit strange when somebody comes along out of the blue and says, "Oh, I took your data and I got this and this out of it in addition." It's a touchy subject, which is also probably why it took a while for it to take off. So, doing an economic analysis of this sounds like a wonderful idea if you are completely out of [the field] — if you majored in economics and want something to talk about. Inside the field, you might be shooting yourself in the foot. I think the economics would be highly positive, because the data has been generated, the money has been spent, and very often the data has served its purpose and resulted in a publication, which is why it is in the public domain in the first place. So that dataset has gone through the entire value cycle, and any additional value you extract from that dataset is going to be a net gain.
With regard to the concern about being branded a "parasite," what are the conventions for crediting the original researchers behind a dataset that you reanalyze? Do they need changing?
There is a big discussion about that in the life sciences in general. This is not just a proteomics issue. Now, of course, there is a simple mechanism: when we reuse someone else's data, if it has been published, we add the publication to our reference list, and we usually say in the acknowledgements we would very much like to thank the people who submitted their data to PRIDE. So this is what we try and do, and essentially this is how science is built, right?
If you make a beautiful protocol for a bit of wet lab work, and people use your protocol, you don't want to be a co-author on their paper. You just want them to cite your protocol. It's very similar in a way with data reuse, although people feel differently about it. If you make a beautiful dataset, you get your paper or papers out of it, and you make it publicly available as a result of that. Then if people can do other things with it that should be fine. Imagine if the people who sequenced the human genome felt that people using the genome sequence required them to get massive credit. It's the same thing.