The European Bioinformatics Institute has announced plans to provide storage for raw mass spec data as part of its Proteomics Identifications Database, PRIDE.
With the move, EBI aims to offer researchers a stable home for raw proteomics data, something the field has lacked since the University of Michigan-based Tranche repository began cutting back its activities in late 2010 due to lack of funds.
The volume of mass spec data is growing exponentially as scientists increasingly adopt proteomics as a research tool and instruments reach ever-faster acquisition speeds. And as this flow of data increases, it has proven challenging to build and maintain repositories for holding and sharing it.
In particular, researchers have recently struggled with storage of raw mass spec files, as Tranche – the proteomics community's primary repository for such data – has run into funding problems that have affected the system's stability and accessibility (PM 9/30/2011).
Given these issues, EBI, which recently received a $117 million grant from the UK government to build a bioinformatics hub as part of the European Life-science Infrastructure for Biological Information, or ELIXIR, initiative (BioInform 12/9/2011), has moved to fill the breach.
The institute has thus far accepted two raw data submissions as an initial test of its system and plans to start ramping up its efforts by "gradually taking [raw data] submissions from existing collaborators, Henning Hermjakob, EBI's team leader of Proteomics Services, told ProteoMonitor. He added that his team aims to have the system fully operational and accepting submissions from all comers by the end of the year.
The decision, Hermjakob said, stemmed from a series of discussions within the ProteomeXchange consortium, a group established to ensure coordination among the major proteomic data repositories, including PRIDE, Tranche, and the Institute for Systems Biology's Peptide Atlas. That organization, he said, decided that because of Tranche's difficulties, EBI should take over raw mass spec data storage.
Although PRIDE had previously stored only processed mass spec data and metadata, adding raw data shouldn't present much of a challenge, Hermjakob said, particularly given the funding and infrastructure the institute already had in place for storage of genomic data.
"Next-generation sequencing data is currently vastly larger in volume [than proteomic mass spec data], and we have the infrastructure set up to cope with that," he said. "So in that context, the proteomic data is something that can be served by the same infrastructure with limited additional costs. So it's a new activity, but it's not a major [challenge]."
EBI didn't host raw mass spec data in the past largely due to the existence of Tranche, but also because it was unclear how much demand there was in the research community for such data, Hermjakob said.
"The demand for proteomics raw data storage has been around for a while in discussions, but … proteomics raw data has potentially not that big a usage community," he said. "[A] limited number of labs around the world can make use of this data in terms of reprocessing it or re-evaluating it, and that's why we've been very reluctant [to host such a resource.]"
He added that the institute will be keeping an eye on usage levels, and "if there is very little usage in terms of not just submissions but downloads of the data and ideally citations, then we might come to a point in the future where we say, 'No, this is not a good use of taxpayer money in the end.'"
Hermjakob noted, however, that recent trends suggest increasing interest among proteomics researchers in using raw mass spec data. He cited a recent satellite gathering at the Human Proteome Organization's Proteomics Standards Initiative meeting at which a "major message was that there is strong demand for proteomics raw data storage and access."
"It's definitely a concern in the community, and there is definitely the feeling that the number of groups who can meaningfully use the data and are doing something with the data is increasing," he said.
Data sharing is typically considered an important practice across the sciences, but it's perhaps even more important in the case of data-intensive disciplines like proteomics, where a paper's original authors may have been interested in only a small slice of the information their study generated.
As Phil Andrews, the University of Michigan researcher behind Tranche, suggested to ProteoMonitor in an interview about the repository last year, "what we do often in proteomics is that we have a specific aim or two that we're trying to address in a given experiment, and we generate a large dataset but we may only be interested in one aspect of it – say phosphorylation or what proteins change levels."
"But there's a huge amount of data in there, and that could be used by other laboratories if that data were made available," he said. "The idea is you get value added. If you get two labs using a dataset, then you've basically doubled the cost-effectiveness of that experiment."
In addition to individual researchers, proteomics journals have been a major force behind the drive to establish repositories for raw mass spec data. The journal Molecular & Cellular Proteomics, for instance, had previously mandated that all papers be accompanied by the submission of their raw mass spec data. In light of Tranche's troubles, however, the editors put that requirement on hold.
Now, with EBI planning to accept raw data, MCP hopes to reinstate the requirement, Robert Chalkley, a University of California, San Francisco, researcher and member of the journal's editorial board, told ProteoMonitor.
"MCP has been pushing for alternatives [to Tranche], and the EBI option looks like the most promising because they've got this huge [ELIXIR] grant from the UK government to build a data storage center," he said.
Chalkley said that MCP plans to wait to make certain the EBI repository works properly before putting its raw data mandate back in place, but, he said, "if everything is as expected then we would reinstate the requirement [that authors] submit raw data with publications."
In addition to PRIDE, the ISB's PeptideAtlas, and particularly its new Passel database, could also serve as a raw data repository for proteomics researchers, Hermjakob suggested.
Introduced at the 2011 HUPO meeting, Passel is intended to provide experimental selected-reaction monitoring assay data to complement the synthetic peptide-derived assays that currently compose the bulk of the ISB- and ETH Zurich-developed SRMAtlas (PM 2/24/2012).
Led by ISB researcher Eric Deutsch, the database aims to flesh-out the SRMAtlas resource by adding other researchers' experimentally derived SRM-MS assays and datasets and providing information on how given peptides work in SRM-MS assays performed on particular biological systems.
With this addition of experimental data, it could make sense for Passel to serve as a repository for the raw mass spec data from these SRM experiments, leaving PRIDE to concentrate on shotgun data, Hermjakob said.
"PRIDE is focused mainly on [shotgun] data and Passel is focused mainly on SRM," he said. "So we are trying to have as little overlap as possible."
Key to combining all of these resources in a useful way, Hermjakob added, will be capturing the metadata that allows scientists to sort and search through the mass spec data for research relevant to their work.
"This is a major challenge that takes a lot of time and effort to sort out," he said. "But if you have [for example] Orbitrap Velos spelled in five different ways plus another half-dozen misspellings, then you can't easily filter all the experiments that have been done on an Orbitrap Velos."