Team leader, proteomics services
European Bioinformatics Institute
Name: Henning Hermjakob
Position: Team leader, proteomics services, European Bioinformatics Institute, 2005 to present
Background: Sequence database group coordinator, EBI, 1997-2004; Diplom (M.Sc.) Bioinformatics, University of Bielefeld, Germany, 1996
Henning Hermjakob has been one of the leaders in the Human Proteome Organization’s Proteomic Standards Initiative. In the August edition of Molecular BioSystems, he and Lennart Martins, senior software developer on the PRIDE project in the proteomics services team at the European Bioinformatics Institute, authored an opinion piece on why proteomics researchers need to make all their experimental data publicly available.
In it, they outline the controversy around making such data available, review some guidelines that have been already suggested, and offer some recommendations of their own. Below is an edited version of a conversation ProteoMonitor had with Hermjakob this week.
An abstract for the article can be found here.
Why did you write this opinion piece? It suggests that you feel you need to convince people to share their data.
Yes, that’s clearly correct. And it’s very hard to convince people to submit their data. There are two major obstacles. One is simply the additional work. People just say, ‘Why should I do it?’ And the other is reluctance to part with your data, probably because then indeed problems in the data could be easily found.
People want to fully exploit the data that they generated before they make it available to anybody else. And often by the time they feel that [the experiment is] done or even not done, the grant is finished and there’s no interest in their making the data publicly available.
Let’s talk about both reasons. The additional work — shouldn’t researchers be collecting the data and keeping notes on the data as they’re conducting their experiments anyway?
Well, sometimes the information is really not there, because it is buried in a lab book or it hasn’t really been taken because it wasn’t considered important. And the other point is even if the information is there, which I think is normal, then it is still work to get this into another format. It’s at least copy-and-paste from one system to another system, and it might mean some retyping of some information. So there is some additional effort. Of course [for] some large-scale labs, they just need to set up the appropriate export routine [once], and then this can be done many times and they can export the data in the right format from their internal representation, which they have anyway.
But for small-scale labs, this is rarely the case. There is often not enough bioinformatics support, so it’s more or less a manual job. We as database providers consider that it’s our task to make this as painless as possible and we [make] a lot of effort to provide good tools to allow people to do this as efficiently as possible.
How far along are we in making this as seamless and painless as possible, even for the smaller labs?
It’s hard to say. The PRIDE database’s Proteome Harvest spreadsheet is really going a long way in making the submission process painless even for a small lab.
But it’s clear that this is not perfect, and one of the big things that we are still working on, which is working to a certain extent but which needs to be improved, is that the mass spectrometry instruments themselves should right away output formats that are appropriate for database deposition. This has worked to a reasonable extent with the mzXML format from the Proteomic Standards Initiative and with the mzData format from the Institute of Systems Biology.
Now we are working together with the ISB to provide one uniform format so that the vendors have to implement only one format and we are basically taking the best from both worlds, and we hope to release that format by the end of the year.
And extrapolating from previous acceptance, I think that this will be quite well accepted. The instrument vendors have heavily participated in defining this format, so I think we can expect a good takeup by the vendors in terms of implementing it, especially as we now will be able to say, ‘This is the one format. There are not two, which are more or less competing with each other.’
So this should be another major step forward.
It still sounds like you’re saying that this is a significant obstacle, the additional work.
Yes, there is still significant work, but it needs to be seen in context. If a grant runs for three years and the data generation and interpretation takes three years, and people then are reluctant to make the data [available] — which after all has been paid for by taxpayers — and it takes no more than three hours to make the data available, then this is certainly something that needs to be seen in the overall perspective of the grant.
Even if we get to the point where we have the bioinformatics tools to make submission of the data easier, we still have the second reason that you mentioned: that people are reluctant to make their data available. And it sounds like you’re talking about the “scoop” factor, that people are afraid that if the data is out there, someone may take it and find something the original researchers didn’t and take credit for it. How do you overcome that?
I think this can never be fully overcome, but it is something which needs to be overcome in the long run in terms of setting up a decent time frame. We are not arguing that everybody should make their data immediately available as it is generated like it has been done with sequence data according to the Bermuda Principles. This was another kind of project where the financing was exclusively for generating the data. There should be, for each data-generating project, a clear policy with timelines for when the data should be released. And this should be something which is more and more a part of grant applications and grant agencies — more and more are looking into what is [a researcher’s] data release policy.
I think something like at least after six months after the end of the project, all data should be publicly available. It’s a perfectly valid policy and we are providing the infrastructure to make that possible in the way that you can submit the data at a time point X, and we will only release it at X plus Y months.
So, for example, if a grant comes towards the end, the data can be submitted. It will be kept confidential, but even then, if once the grant is finished, there isn’t anyone available to do the work, then the data will just be automatically released at a given time point, which has been set by the original data owners.
Have you made these suggestions to the community at large?
This is frequently discussed in the community and this is also something which has been considered, for example, in the discussion for the data management policy of the [UK’s Biotechnology and Biological Sciences Research Council], which is currently in the definition stage.
What has been the reaction to it?
It’s mixed. Some people are really in support of it, and others are reluctant to implement it.
What have been the objections? Has it been the timeline or that they don’t want to make their data public ever?
I think the objections tend to be around the additional work associated with it, but very few people say, ‘No, I want to keep my data absolutely private,’ because this is more and more a position that is difficult to defend. But it’s clearly in the background somewhere, I feel.
Have commercial vendors or academia complicated this issue? For example, have either ever said, ‘No we don’t want the data being generated by our researchers out there. If something comes out of this work, and there is some dispute about intellectual property, we want to make sure we have the rights to it’?
I think for companies, there’s rarely ever data coming out. This is very clearly associated to this public fear of making it publicly available — the fear that somebody else might benefit from it. ‘Our competitor might benefit from it.’
But more and more, the attitude is changing, and there is a growing realization that making data available freely also prevents patenting by others. So now, for companies, it shifts really a bit more toward the additional work associated with the data. At least from what we hear from companies, it’s more and more ‘Yes, we could release some data, but it’s too much work, and the benefit is not clear.’
So, there is a slight shift in thinking, I would say.
One other point, which is very important, is that both for academics, of course, the databases attribute the data to the original submitter and to the associated publication, and one thing that many people don’t realize is that through submitting the data to a database, the publications also get more visible because more and more people now start through database searches where previously they would have started through literature searches.
So publications which have released their data get more exposure, get more access and ultimately get more citations, which is a huge benefit, which often is not realized because this has only recently emerged.
Some journals and funding agencies now recommend researchers submit their raw data when they submit their studies or applications. Have you noticed that this has had any effect?
Yes, it’s a very noticeable change and we see it especially in the submissions to the PRIDE database. There is a significant hike since Nature Biotechnology [asked researchers to submit their data]. Since then we’ve gotten significantly more submissions to PRIDE because that was a high-profile publication — partially through Nature Biotechnology, but also from others who just realized this is getting more and more required by journals.
The editors of Proteomics are now often asking their authors to submit the data to a database, so it’s very noticeable that the tide is changing.
Once they are told by editors, I think then they do it.
Because they have no choice?
Usually so far, it’s not ‘You have to [submit the data],’ The journals are quite careful about this. This was also what was stated in Nature Biotechnology’s editor. They are only strongly recommending it, but if an author would say ‘No,’ then this would very likely not be a reason, in itself, to reject the paper. But there are cases where at least it would weigh significantly, specifically, let’s say, the database briefs in the journal Proteomics. The whole purpose of that section is to provide interesting datasets.
These papers are not about a new methodology. They are about well-done proteomics on a novel system providing a reasonable dataset, and for this section, there might be strong pressure to make the data publicly available.
In your article, you talk about the lack of information about data processing in data submissions. Why has that been?
I think to a large part because the importance of the data processing has grown over time, and nowadays, it’s a key step, probably as important as the methodology. This realization has started to permeate. It’s just something which people have not been paying so much attention to in the proteomics field [as they have], I would say, in the microarray world, where from the beginning there was a very strong realization that the data processing is key.
The microarray world from the beginning had quite a strong community feeling and activity in making data publicly available. As a result, enormous progress has been made on the statistics on the microarray data.
You recommend using “common, community-driven standards” for the minimum requirement in reporting this data, rather than creating some in-house requirements.
There are two things, one is what you have to report, and the other is the quality control. We think these two should be kept separate. PSI really aims at describing what you should report, so you have to tell in your manuscript how many biological replicates you have done, but we don’t say you have to do five biological replicates, because this is something which can’t be well formulized in our opinion, which is very technology dependent, circumstances dependent.
So the quality control is something which should be defined by the journal and implemented by the editors while the reporting standards we think should be globally [defined] and should be journal-independent and should be implemented across the board so that tools can be developed against formats, that tools can assume that certain data items are there, etc.
Is there enough information available and enough agreement among researchers to make this community-driven standard possible now, or is that something far into the future?
I think that’s not at all far into the future, and we are now seeing the first results of this process. We have been working in PSI on all these reporting standards for the last four to five years. And we now see the first finalized modules. We will have in this month’s Nature Biotechnology issue what we call the MIAPE [minimum information about a proteomics experiment] parent paper, which gives the framework for the minimal information for a proteomics experiment.
And the first implementation module, which is technology- and domain-specific is also appearing in the same issue, namely the MIMix, the minimum information for reporting molecular interaction experiments.
[Editor’s note: Articles describing MIAPE and MIMix are in the August issue of Nature Biotechnology]
What do you hope will happen when they come out? It’s one thing to come up with guidelines and another to get people to follow them.
This is clearly a process which will take some time. But it is important to have them as a central reference point which can then be implemented by tools. Clearly these guidelines are also an easy point of reference for journals, for funding agencies, to say ‘In your publications, you have to provide data according to this publication.’ And this is something that will more and more come. These will be used as the reference point.