Proteomics data consortium ProteomeXchange appears to be gaining momentum, with several hundred new datasets added to the resource in the last eight months.
At the Human Proteome Organization's 2013 annual meeting, the consortium's organizers said that as of that July it had received 310 submissions. Since then, the consortium has more than doubled that, announcing in a letter published this week in Nature Biotechnology that as of February 2014, 685 mass spec datasets comprising roughly 32 terrabytes of data have been submitted.
The uptick is indicative of an ongoing shift in the field's mentality, said European Bioinformatics Institute researcher Juan Antonio Vizcaíno, project manager for the consortium, suggesting that data sharing is slowly but surely becoming more common among proteomics researchers.
"People have been reluctant" to submit their raw mass spec data to repositories, Vizcaíno told ProteoMonitor, noting that this reluctance has been due in part to fears that other researchers might find things they missed in their raw data, and to a much larger extent simply to the scarcity of time and resources.
"Scientists in general are very busy. If they are not pushed to do something, then by default they will not do it because it is extra work," he said. "But there has been a gradual change in mentality. [Particularly] now that the repositories have improved in the last few years, people are more keen to submit their data."
The ProteomeXchange consortium was formed in 2006 with the aim of improving coordination among various proteomics data resources, providing a single framework and infrastructure through which researchers can access data from the field's major repositories.
"The idea behind it was that there were quite a few proteomics resources available, but there was no formal collaboration between them," Vizcaíno said. "It was very hard – and still is – for scientists to not only have a common way to submit data, but also access data that was already public."
"People had to go to different resources, making the same searches several times," he said. "So the idea was to have a common framework for submission and also dissemination of proteomics data."
The resource was formally launched in January 2011 with a grant from the EU's Framework Programme 7 and began accepting submissions in June 2012. It currently brings together data from the EBI's PRIDE repository as well as the Institute for Systems Biology's PeptideAtlas and PASSEL repositories.
The consortium is currently working with researchers at the University of California, San Diego to add that institution's MassIVE data repository to the resource, and has begun talks regarding adding the University of Washington's Chorus resource, the Technical University Munich's Proteomics DB resource, and the Chinese iProx resource, Vizcaíno said.
In the Nature Biotechnology letter, Vizcaíno and his co-authors provided a snapshot of the current composition of data sets accessible via ProteomeXchange. For instance, of the 685 submissions, 309 consist of human data, 79 are mouse data, 31 are Arabidopsis, and 23 are yeast. In total, more than 200 species are represented.
Geographically, there are submissions from more than 130 countries, with the US leading the way with 123 datasets, followed by Germany with 94, the UK with 57, Switzerland with 53, and the Netherlands with 43.
The overwhelming majority of the datasets – 656 – consist of tandem mass spec data, though there are also 29 selected-reaction monitoring datasets from the ISB's PASSEL repository.
The resource is also adding datasets from top-down experiments and experiments using data-independent mass spec methods like Swath, Vizcaíno noted. Interest in DIA methods, in particular, is growing within the field, he said.
However, because tools for converting and exporting these types of data are not yet fully developed, they can only be added to the resource on a "partial" basis, he said, meaning that while they are available for download and searchable via their metadata, they cannot be queried by identifiers like protein IDs.
While not offering the same functionality as the fully supported submissions, the "partial" submission model allows the resource "to avoid having to reject data," Vizcaíno said, and allows it to include data from new experimental workflows.
In addition to providing a means of accessing proteomic data across various repositories, the consortium also has in place steps aimed at preserving data in the event that one of the participating repositories goes offline.
Given the instability of funding for proteomics data repositories, keeping existing resources up and running has proven a significant challenge for the field, with large repositories including the University of Michigan's Tranche and the National Center for Biotechnology Information's Peptidome having gone offline.
Funding "is a problem that everyone has," Vizcaíno said. "A lot of resources are going to have problems, and this situation can't be avoided."
To help mitigate the effects of such funding problems, the ProteomeXchange consortium has arranged for the organization's stable repositories to rescue data from failing repositories as needed.
"In the case that any partner has problems like what happened with Peptidome or Tranche, the other partners will try to rescue all the data from the resource that is disappearing," he said.
The consortium performed such a transfer of the Peptidome data "as a proof of concept, just to demonstrate that it could be done," he noted. "It was quite resource intensive, but it can be done. And this is the only way that long term sustainability can be assured."