NEW YORK (GenomeWeb) – Last week Ghent University Professor Lennart Martens and Juan Antonio Vizcaino, head of the PRIDE mass spectrometry data archive at the European Bioinformatics Institute, published a commentary in Trends in Biochemical Sciences titled "A Golden Age for Working with Public Proteomics Data."
A leading figure in proteomics informatics and one of the original developers of PRIDE, Martens has a longstanding interest in the collection and dissemination of proteomics data and has been a vocal proponent of its reuse and reanalysis.
GenomeWeb spoke to Martens this week about the TBS commentary and why he considers the current environment so promising for researchers working with published proteomics data.
Below is an edited version of the interview.
Your review calls this a "Golden Age" for public proteomics data. What makes that the case?
I actually think that the key thing that led us to write this review is that the infrastructure is now in place and … is being heavily used to disseminate information. If you look at the amount of data that is being pushed through proteome exchange these days, I think the latest numbers from the PRIDE repository is that they release about eight datasets per day, on average.
That's a pretty large number and I would imagine that as we get into 2017 that number will go up. It might actually almost double. There's an enormous amount of information being disseminated through this very well established infrastructure. That actually creates, I think, the foundation for this golden age.
Then the second thing is that we are starting to see this reuse of the public proteomics data. We see independent researchers starting to pick up this public data and reusing it.
Are there any areas of focus you see among people who do a lot of reanalysis of public datasets? Any areas where this is proving to be particularly useful?
There are actually three that strike me as currently being really big success stories. The first one is the search for different PTMs, so post-translational modifications. There are two [notable] studies — one which has to do with the detection of ADP-ribosylation of proteins, which is a known modification with some biological significance that is rarely studied with proteomics, if at all. [Ed notes: covered here by GenomeWeb.] The second one, this was a study by [Technical University of Munich Professor] Bernhard Kuster, on O-linked glycans, and these are also very much understudied.
What these two groups did was they took existing data and they reanalyzed these datasets with a specific view to finding these [modifications]. And it turns out that that was very successful, and these were quite high-impact papers.
That really showed the people in the field that it is perfectly possible to take data acquired for a given purpose with a given protocol and completely repurpose that data to a different type of study.
The second thing are all the proteogenomics analyses that have been done... It's perfectly possible to take proteomics data and really dive in more detail into some more obscure parts of genome annotation.
The third component is more a technical component, and that has to do with this building of spectral libraries. Spectral libraries have become very important for data-independent acquisition technologies like SWATH and MSE, where they form a cornerstone of the bioinformatics part of the analysis. So these are three really big success stories where I would say, yeah, public data just works.
What is the attitude among funders towards reanalysis work? Is it harder to get money?
It's a very interesting question. I submitted a [European Research Council] proposal a few years back, I think it was 2013, in which I wanted to do this large-scale reanalysis of the human proteome based on all the public data, which we meanwhile have done and which has yielded very, very nice results. But it wasn't funded back at the time. I think the most pertinent comment that came back was, 'It's low risk if it works,' which I today still think is an amazing review comment. There’s no way you can get around that one.
I would say then it was a bit difficult. I mean people really looked at this and I remember one of the reviewers also saying it just goes to show how badly people process proteomics data that there's so much undiscovered, which I think completely misses the ball, in fact. But it's a common misconception, right? It's just data is hard. There's just so much biology that we don't know yet so we have no clue on how to interpret the data.
Anyway, I think there was a time when this was definitely difficult. Right now, at this moment, I've not submitted another grant on this, but it's funny that we’re talking about it now because I am planning on submitting another one. So maybe I can answer that question about the funding in a while. I don't know of anyone who has a grant specifically dedicated to reanalysis of public data.
I would guess the obstacles to getting a grant like that are predominantly with the mindset of the reviewers. I think if you got a young review panel, people who kind of grew up in the era of ever more data, they might see this slightly more easily than if you get a grant review panel of people who are little bit more in the… I'm going to say this in a diplomatic way… more in the mindset of 'a single dataset builds the entire story.' So that might be a generational thing and not limited to proteomics.
How about journals? Is it more difficult to get papers accepted?
I've never really had any big problems with that. In fact, I think people appreciate the fact that data is being put to good use. Nobody really looks at it and says, 'Oh, they’re stealing data,' at least not in the field of proteomics, really. We don’t suffer from this idea of research parasitism which, for whatever crazy reasons, has infected the field of medicine where they have some very vitriolic comments about it.
When we started talking about public proteomics data. I distinctly remember getting into some heated arguments with people who really disliked the fact that other people would get their hands on their data. I think that notion has completely gone now. People see that their work gets cited and that their work gets amplified in significance because other people can do more with it. So I would say that battle has been won.
Does this sort of approach present any sort of opportunity for, say, younger researchers or those who aren't necessarily as established and maybe don't have access to much money to get some work done that doesn't require a lot of start-up funds necessarily?
I keep telling young people, especially young PhD students or young post-docs, look, you've got seven to eight datasets coming out per day at a minimum. Just imagine all of the things you could possibly do. And there are so many nice [informatics] tools out there that can automatically link up to this proteomics data. So, really, the limit is your imagination. You need nothing but a computer.
There is this huge opportunity out there, especially for young people, to grasp the significance of all this data and to just come up with the craziest things that you could possibly ever do with this data. Try it and some of it is going to purely gold, and the rest is just going to be fun. It might not get you a Nobel, but it's fun. We recently did something crazy in our group looking at protein association across datasets. The manuscript has now been submitted. I won't say too much about it yet. Let’s see what the reviewers think about it.
But it turned out to be amazingly successful, and it was just one of these crazy ideas where you go to a student and you say, are you willing to take this risk and do this crazy thing that I'm proposing? But you know within, say, two or three months whether it's a go or a no-go. So you get feedback really quickly as well.
So, yeah, if I were a young researcher today, I know where my efforts would be. It would be in trying to come up with something original to do with this data.