Skip to main content
Premium Trial:

Request an Annual Quote

Researcher Warns Lack of Raw Data Sharing Is Making Proteomics Vulnerable to Fraud


According to one proteomics researcher, a lack of requirements around the deposition of raw mass spec data has left the field vulnerable to shoddy work and fraudulent results.

Ben Orsburn, senior applications scientist at Thermo Fisher Scientific, told ProteoMonitor this week that proteomics' increasingly large datasets and sophisticated statistical techniques have created opportunities for researchers to push questionable findings into print, a problem that he said has touched even some of the field's top journals.

While acknowledging that he had not collected definitive proof of such activity, Orsburn said that he, and others in the field, have noticed a number of papers boasting results that would seem implausible given the workflows and instrumentation being used.

Such activity is particularly prevalent, he said, in studies focused on maximizing peptide and protein IDs, "where people [say] … 'All I want to say is I got this many peptides and this many proteins, and that is the paper.'"

Initially raising the issue on his blog, Proteomics News, Orsburn said he believed that one of the main ways researchers manipulate their findings is by tinkering with the statistics used to calculate an experiment's false discovery rate.

"You can monkey around and change a few things, and suddenly you're competitive with what the Max Planck Institute can get on the same instrument," he said, noting that the increasing complexity of the methods used for calculating FDRs means experienced biostatisticians can effectively hide such manipulation.

"There are absolutely ways of burying it," he said. "Biostatisticians can absolutely bury the FDR in a whole slew of equations that as a biologist there is no way I can dig through. I can't assess the validity of it because I don't have the years of mathematical background to take it apart."

And while researchers could simply rely on straightforward decoy database searches to calculate FDRs, "more advanced statistical approaches like [the University of Washington's] Percolator [algorithm] have shown that they can generate really good results that just a standard 1 percent FDR cut-off will miss," Orsburn said.

Olga Vitek, a researcher at Purdue University and an expert in the statistics of mass spec-based proteomics, agreed with Orsburn that the field suffers from improper use of statistics. However, she told ProteoMonitor that she believes it is largely due to a lack of researcher expertise, as opposed to willful fraud.

"Essentially what people say is: 'We have a menu of options to use, so, out of those, here is a subset that we can implement in our workflow, and from those we will pick the one that gives us the most IDs,'" she said. "And what is lost in the process is that there is a reason why there are these different methods, because they are for use in different contexts, and they use different assumptions and so on."

The statistics underlying mass spec-based proteomics are "very difficult," Vitek said. "What I see a lot is people are asking for [integrated] pipelines where you put your samples in and you get p-values and your IDs at the end, and they are hoping that by pushing a button the whole thing will run."

"But that is not possible because even one small change in the workflow will require different statistics," she said. "So you need a statistics expert, just like you need a chromatography expert and a mass spectrometry expert."

This statistical complexity, Orsburn said, also makes access to raw data especially important because while a researcher might not have the expertise to evaluate the statistics behind an experiment's FDR, they can evaluate the spectra quality.

"At the end of the day, I can take a look at some of the raw spectra and say [for instance], 'There is no possible way that this is a valid spectra'," he said.

To do this, however, researchers need access to an experiment's raw data, something that, Orsburn said, can be difficult to obtain. He cited the case of a colleague who he said in the last year has requested data from five different groups, in each case receiving no reply.

Sharing of raw data has long been an issue in proteomics. The volume of mass spec data has grown exponentially as scientists increasingly adopt proteomics as a research tool and instruments reach ever-faster acquisition speeds. And as this flow of data has increased, it has proved challenging to build and maintain repositories for holding and sharing it.

This, in turn, has informed journals' decisions as to whether or not to require researchers to deposit raw mass spec data for published papers. The journal Molecular & Cellular Proteomics, for instance, once mandated that all papers be accompanied by the submission of their raw mass spec data. The journal suspended that requirement in 2011, though, in response to difficulties with the University of Michigan's Trance database – at that time the field's primary repository for raw mass spec data.

Tranche's troubles affected policies at other journals, as well, including the Journal of Proteome Research, which similarly chose not to require deposition of raw data given Tranche's instability.

In recent years, several potential alternatives to Tranche have arisen, including the European Bioinformatics Institute's Proteomics Identifications Database, which in 2012 began accepting raw data; the Institute for Systems Biology's PeptideAtlas; the University of California, San Diego MassIVE database; and the University of Washington's Chorus database.

In an email to ProteoMonitor, University of California, San Francisco researcher Alma Burlingame, editor of MCP, said that the journal was "planning to lift its raw data moratorium sometime quite soon," noting that it led a recent meeting at the National Institutes of Health discussing the matter.

William Hancock, editor of JPR, likewise told ProteoMonitor that "important steps are being made" in terms of establishing a policy on raw data.

Both editors noted that their journals have levels of review designed to keep out the sort of studies that have concerned Orsburn. MCP, for instance, "requires annotated tandem spectra for all single peptide IDs, and for all covalently modified assignments, to allow others to evaluate whether the interpretations appear correct," Burlingame said.

"This compliance check that MCP carries out on every submission usually detects and deals with this kind of issue," he added.

Vitek, who is on the editorial board of MCP, noted, however, that there are currently too few statistical experts to effectively review all the papers produced by the field.

"There are not enough statisticians in this area, that is for sure," she said. "There are only so many people who can do this [sort of] review, and there are only so many reviews that [an individual researcher] can do. The journals know that they need it. I personally get a lot of requests. But I don't accept all of them because I would be doing [reviews] fulltime. It's just not possible."

Orsburn said, in fact, that the most egregious example he'd come across in the last year was a paper published in MCP, although he didn't specify the paper or group.

"I thought it was the most blatant misuse of statistics, just an absolute abuse of the system," he said, adding that "if that raw data were available, we would force a retraction right now. But the data isn't available, so we can't do anything about it."

Given the massive amount of data comprising a typical mass spec-based proteomics experiment, sharing data with outside researchers can be inconvenient, Orsburn noted. "If you're looking at 100 Q Exactive runs, for instance, that's 200 gigabytes of data."

And this, in turn, lets researchers hide behind these large datasets, he said. "Until the journals go back to requiring raw data upload, it just invites abuse where if you are a little unscrupulous, and you know you have something to hide, you can create a very large buffer by not making that data publicly available."