As scientists increasingly adopt proteomics as a research tool and mass spectrometers reach ever-faster acquisition rates, the amount of data being generated by the field is growing exponentially.
At the same time, however, proteomics data repositories are struggling to stay afloat, with resources like the National Center for Biotechnology Information's Peptidome database and the European Bioinformatics Institute's International Protein Index shutting down, and others like the University of Michigan-based Tranche repository having to cut back activities due to lack of stable funding.
Such difficulties have hampered efforts by journals and scientists to make proteomics datasets more widely accessible, and, suggested several people interviewed by ProteoMonitor, are likely reducing funding agencies' returns on their investments in proteomics research.
"There are essentially three main proteomics data repositories" said Eric Deutsch, senior database designer at Seattle's Institute for Systems Biology and head of the ISB's PeptideAtlas repository. "There's PRIDE from the EBI in the UK, Tranche, and then PeptideAtlas."
Each of these resources, Deutsch said, was developed as part of a larger research project, and while "they've proven very valuable to the community, when those original projects were done, it has been very difficult" to get funding to keep the repositories up and running.
In the case of PeptideAtlas, funds for its maintenance have come through other ISB projects, like the SRMAtlas, which is sponsored by $2.7 million provided by the National Human Genome Research Institute under under the American Recovery and Reinvestment Act as well as $4.1 million from the European Research Council (PM 9/24/2010).
Tranche, on the other hand, has struggled to find a steady source of money since the funds it was receiving through the National Cancer Institute's Clinical Proteomic Technologies for Cancer initiative ran out around the end of 2010. Due to this lack of funds, the resource, which is a primary repository for raw mass spec data, has been forced to operate at reduced capacity for much of this year.
"The development funding for Tranche came from the National Center for Research Resources," said University of Michigan researcher Phil Andrews, leader of the Tranche project. "Funds were available for about two to three years to develop that capability, and it was initially just a pilot project to see if it was feasible."
"We developed the prototype system, and it worked really quite well," he said. "We brought it online for an interim time as a service, and an increasing number of people were using it. So once the [NCRR] funding was completed, we were able to get funding from the CPTC project for a couple more years to support the datasets generated by the CPTC centers."
Once this funding ended, though, it proved difficult to find another source, Andrews told ProteoMonitor, particularly given the expense of keeping up with the growing demand for the system.
"What we found was that there is this feeling that once you develop a system, the ongoing maintenance costs are really going to be quite low," he said. "That's true if it's just sitting on a server and you don't have to do anything but maintain that server. But maintaining a system like [Tranche] you have to constantly revamp your system. You adopt new technologies. You make it as efficient as you can. And these things keep growing. Tranche started growing so quickly that it was difficult to keep up with the limited resources that we had available. It was the classic [case of] becoming a victim of your own success."
Last year, as the repository approached the end of its funding, its two primary developers left for other jobs, and with no new source of support lined up, it was essentially impossible to fill their positions, Andrews said. Meanwhile, with use of the repository increasing exponentially, unexpected memory problems emerged, making it difficulty to keep the system's servers consistently up and running.
"You can see the kinds of problems you run into," Andrews said. "You don't even have to run out of funding; you just have to get close to the end of funding."
"The key thing for these kinds of resources is to have a stable source of funding," he added. "That doesn't mean that it has to be funded forever, but that you have a reasonable expectation of funding and review on a regular basis to evaluate whether it's a cost effective service or not."
This sort of stable funding is hard to come by, though, Deutsch told ProteoMonitor. While it might be expected that the National Institutes for Health would help support repositories for storing and sharing the data generated by the research projects it paid for, in practice, he said, there's no clear mechanism at the agency for funding such ongoing maintenance.
"NIH for the most part is in the business of funding new research projects that are trying to improve human health, but NIH has rather few mechanisms for continuing maintenance" of data repositories, he said. "We at PeptideAtlas and the folks at Tranche have applied for some additional finding to keep those repositories going, but none of the [agency's requests for applications] seem to fit with what we're trying to do, so it's been difficult. We haven't been successful in getting funding."
This lack of resources for data repositories has affected the proteomics community, particularly with regard to the field's journals, some of which have been forced to relax guidelines mandating the submission of raw data with all papers due to the repositories' difficulties.
The journal Molecular & Cellular Proteomics, for instance, had mandated that all papers be accompanied by the submission of their raw mass spec data, but in light of Tranche's troubles, the editors have put that requirement on hold.
We've "had to back away from that because of the problems with Tranche," Ralph Bradshaw, a University of California, San Francisco, researcher and the co-editor of MCP, told ProteoMonitor. "Right now there really isn't an adequate place to deposit raw data, and so the journal decided that it couldn't require people to do something that, in fact, there simply wasn't adequate means to do. It's a temporary thing. It's our intention to go back to this just as soon as we feel the situation has corrected itself."
The Journal of Proteome Research would also like to require deposition of raw mass spec data with all the papers it publishes, editor William Hancock told ProteoMonitor, but, like MCP, it currently finds this requirement untenable given Tranche's funding problems.
The chair in Bioanalytical Chemistry at Northeastern, Hancock is also a co-chair of the Human Proteome Organization's Chromosome-Centric Human Proteome Project. In addition to raising issues for journals, the lack of a stably funded proteomics data repository, he said, hampers productivity across the field more generally.
"Without this sort of activity, you're really not getting a proper return on all your investment in individual [investigators], in mass spectrometers, in collection of samples," he said. He cited the example of work done by his C-HPP co-chair Young-Ki Paik, a researcher at South Korea's Yonsei University.
[ pagebreak ]
"[Paik] has a very nice [proteomics] dataset for placenta," Hancock said. "And we we're looking at the results and there were some interesting observations, and there were two questions: Had this been seen before? Is this a good mass spec identification?"
To answer such questions, he said, "you really need a good compilation of proteomic results."
Data sharing is typically considered an important practice across the sciences, but Andrews suggested that it's perhaps even more important for data-intensive disciplines like proteomics, where a paper's original authors may have been interested in only a small slice of the information their study generated.
"What we do often in proteomics is that we have a specific aim or two that we're trying to address in a given experiment, and we generate a large dataset but we may only be interested in one aspect of it – say phosphorylation or what proteins change levels," he said. "But there's a huge amount of data in there, and that could be used by other laboratories if that data were made available. The idea is you get value added. If you get two labs using a dataset, then you've basically doubled the cost-effectiveness of that experiment."
Deutsch agreed, noting that "just by the nature of proteomics data, much of the value that lies in that data is not published or extracted by the original authors, and if it just then sits on their computer then that information is lost."
If, on the other hand, "it can be submitted and maintained in a public repository, then there are quite a few other groups that would be eager to use that data and extract more information out of that data," he said. "I think that would be a very inexpensive way for NIH to get even more benefit from the research that they're funding."
In 2009, NIH provided roughly $375.5 million in funds for proteomics research. Andrews estimated the annual cost of maintaining a repository like Tranche would be around $300,000, which would include salaries and benefits for two developers, a help desk officer, and a part-time systems administrator as well as money to cover the cost of server upkeep.
He suggested several possible reasons for why obtaining money for such a resource had proven difficult, including the fact that it would be an open-ended commitment as opposed to a discrete research project and that there is no obvious disease or translational focus to the work. NIH has also faced budget constraints over the last several years, limiting its ability to grant awards beyond its research funding mission.
"We were not able to get funding from NIH, and it wasn't clear that there was a reasonable funding mechanism for that or that sufficient funds were available to those programs that did exist," Andrews said. "So I think the infrastructure at NIH is just not there to support these kinds of things."
"The US government, which is where [Andrews] got the money to start [Tranche], has a track record of this," Bradshaw said. "They'll lay out billions [of dollars] in research, but then they won't pay a penny to preserve the data. It's very short-sighted, but then who ever said the US government wasn't shortsighted?"
NIH does, of course, fund a wide array of data repositories for other omics disciplines, but most of that funding is directed internally to the National Center for Biotechnology Information, which hosts genomics resources such as GenBank, the Gene Expression Omnibus, and other databases. NCBI had also funded the Peptidome protein database but stopped supporting that resource in February.
NIH officials were unavailable to comment in time for this publication.
Given the difficulty of obtaining government funds for proteomics data repositories, researchers are looking for alternative ways to support these resources. One option that's been raised is having scientists pay to submit their data in a repository – a cost similar to the page fees some journals charge for publishing papers.
For such an approach to work, though, the majority of proteomics journals would have to make submission of raw mass spec data mandatory, Deutsch said.
"If most of the proteomics journals required submission to one of the approved repositories and there was a mechanism to pay for that, then that is potentially a way to make it happen," he said. "I think relatively few researchers would be willing to pay that money unless it were required."
Bradshaw compared such an approach to the current journal publishing environment where private publishers like Wiley and Elsevier typically publish papers for no fee and make their money entirely off subscriptions; society-based journals like MCP typically charge authors page fees to supplement the money they make off subscriptions; and open-access make papers immediately available in exchange for a publishing fee.
"You can pretty much judge who would like which models," he said. "If you're a small investigator with one grant, you're not going to want to pay $3,000 so you're likely to publish in a journal that doesn't cost you anything. The middle ground researcher who has some money but not an excessive amount tends to publish in the society journals because they have higher prestige. And finally you have the types for which money is no object, and they'll pay to have it permanently available from the get-go."
"So who's going to pay for [storing their data in a repository]? Well, the people who have lots of money are going to pay, but the people who don't have lots of money are probably not going to pay," he said. "There are a lot of people who are interested in this [problem], so I think in the long run a solution will be forthcoming. But I can't point to one right now and say what it is."
Have topics you'd like to see covered in ProteoMonitor? Contact the editor at abonislawski [at] genomeweb [.] com.