Proteomics researchers taking part in a National Institutes of Health workshop this week discussed the merits of making raw data sets openly available, as well as the challenges of validating interpretation of mass spectrometry spectra.
Speaking at the two-day conference, which covered standards in proteomics, John Yates of the Scripps Research Institute sparked debate by saying that storage space for large data sets is an issue and that he does not favor distributing data sets because he does not find reanalyzing data sets to be that useful.
“Data storage is one of the things we’re always fighting with,” he said. “I’m not a big fan of distributing data sets. I find that most researchers rarely go back to the [raw] data sets.”
Mike Snyder, professor and chairman of molecular, cellular, and developmental biology at Yale University, disagreed with Yates.
“I think all the data should be out there,” said Snyder. “It’s a healthy part of science to be critically evaluating. Reevaluating things is a very important way of surfacing information.”
Snyder pointed out that by making data accessible to other researchers, the data might be mined for things that the original researchers may not have followed up upon.
“Sometimes you could get scooped on your own data, but the odds of getting scooped on your own data are pretty low,” said Snyder.
Jimmy Eng, a researcher at the Institute of Systems Biology, agreed with Snyder.
“If data is not made available, then there’s no way anyone can even attempt to validate,” said Eng. “There should be a central repository where data is made available for some period of time. The issue of having access to large amounts of data and storing data — I think that’s a secondary concern to the questions we should be striving to answer.”
Merits and limitations of manual validation were discussed, as well as the value of having a statistical score of the reliability of mass spectra identifications.
Kathryn Resing, the director of the mass spectrometry facility at the University of Colorado, Boulder, said that she favors using manual analysis to validate results of search engines such as Sequest and Mascot. However, there are drawbacks to manual interpretation, which include inconsistency and the time needed to train people to reliably do the interpretation.
“Manual analysis is dear to my heart, but it’s hard because sometimes when we’re tired and grumpy, we’re a little more stringent, and when we’re relaxed, we’re more flexible,” said Resing.
To ameliorate some of the inconsistencies, Resing’s research group manually validates everything twice and uses a standard set of rules to eliminate some of the problems.
Marvin Vestal, the vice president of mass spectrometry platform R&D at Applied Biosystems said that manual validation is not a feasible solution when running high-throughput experiments.
“You have to be able to validate automatically because if you run a million spectra a day, manual validation doesn’t make sense with that amount of data,” said Vestal. “We need to run replicates and see what’s the statistical variation in doing experiments over again.”
Yates addressed the issue of repeating experiments by statistically calculating how many times a shotgun experiment would need to be repeated in order to identify all proteins in the sample.
He estimated that if the same sample of proteins is run nine times, all the proteins should show up. What he found experimentally was that when a sample was run nine times, 620 of the proteins identified showed up all nine times, while 400 or so were identified only once.
“What this shows is that in samples with more high-abundance proteins, if you want to pick up more low-abundance proteins, you need to repeat the experiment over again and again,” said Yates.
Typically, when a shotgun proteomics experiment is repeated once, only seven to eight percent of proteins overlap, said Yates.
Researchers cautioned against putting data out that has not been carefully validated.
“If we’re not careful, there’s going to be a lot of junk out there,” said Richard Simpson, a professor at the Ludwig Institute in Melbourne, Australia. “We need to clean this up.”
Simpson pointed out that in analyzing data for the Human Plasma Proteome Project, researchers found that 85 to 90 percent of identifications were single hit identifications.
“To me, that’s a worry,” said Simpson.
Simpson suggested that in publishing data, single identifications should be highlighted as such, for example in a table format.
“It’s important that we can share data and we all have a common standard as to the reliability of that data,” said Simpson. “It needs to be reproducible. The most important thing is that what we publish is going to be right.”
Alfred Yergey of the National Institute of Child Health and Human Development concurred that raw mass spectrometry data should be made publicly available.
“Filtering for peak detection is useful, but I think we’re missing the boat by not spending more effort on the mass spectra themselves,” said Yergey. “Coming from the mass spec end of this, rather than the protein biochemistry end, I think one of the big things missing is a really good, hard look at the mass spectra that we used for database searching.”