PSI, Mass Spec Vendors Developing New Data Format to Deal With Mass Spec Data Analysis


Researchers involved in the Human Proteome Organization's Proteomics Standards Initiative are collaborating with mass spec and mass spec data-analysis software vendors to develop a new data format that could help scientists compare peak lists regardless of which analysis software or mass spec instrument is used.

The format, called analysisXML, was written up in draft form by a working team of about five people who met last month in Seattle with researchers from the Institute for Systems Biology, a collaborator, according to Randall Julian, the chair of the mass spec working group within PSI. Julian is also a scientist at Indianapolis-based Indigo BioSystems.

"Without this, there is almost no data sharing going on," said Julian. "This will allow people to report on what they saw in a common format, regardless of whether they analyzed their data by Mascot, Sequest, or by hand."

The development of analysisXML follows the development of mzData, an open-data format that allows mass spec users to share their data sets more easily (see ProteoMonitor 11/19/2004).

"MzData's stop point is the peak list," said Chris Taylor, a software engineer at the European Bioinformatics Institute who spearheaded the collaborative effort between companies and academic institutions to produce and adopt mzData. "The next nugget is, 'What do I do with those data sets?'"

Both Taylor and Julian said that the next logical step after getting mass spec vendors to adopt mzData was the development of a standard way for comparing mass spec data analyses.

"Let's say I read a paper about a study that looked at serum from a diabetic rat," Julian said. "The study says that this certain protein is upregulated, and I want to see the spectra — did the researchers have enough resolution? Did they even calibrate their instrument? With analysisXML, you have a common data format that allows people to compare analyses."

Julian noted that the HUPO Brain Proteome Project could well benefit from the development of analysisXML because the project calls for the use of four different search engines to analyze mass spec data: Sequest, Mascot, GeneBio's Phenyx and Protagen's PFF Solver (see PM 8/5/2005)

"They're going to want to do a meta-analysis of those results to reach some final conclusion," said Julian. "With analysisXML, in one file, you can capture all four of those analysis results."

Julian added that right now data from the HBPP project is inaccessible to 99 percent of bioinformaticians because the data is stored using Bruker's ProteinScape software, to which most bioinformaticians don't have access.

However, Herbert Thiele, the director of bioinformatics at Bruker Daltonics, said that the ProteinScape software is not so inaccessible.

"ProteinScape is not a black box, but an open source software that is fully understandable," said Theile. "ProteinScape fully implements the mzData format, so that all data can be exported in that data file format."

Thiele added that Bruker will immediately adopt analysisXML as soon as it is ready.

Helmut Meyer, the chair of the HBPP who is also a professor at the Medical Proteome Center in Ruhr-Universität Bochum, Germany, said that the development of analysisXML is a very important step toward being able to cross-analyze bigger data sets.

"Every search engine will come up with some single hits that are not defined by some other search engine," said Meyer. "We need to learn about how to cope with hundreds of thousands of data sets — this is important for really making progress on the data."

According to Julian, researchers first agreed upon a need for a data format to deal with analyzed data during a meeting in Nice, France held last year. Then, during the PSI's spring meeting in Siena, Italy, researchers further hammered out ideas for the new format.

There were already two formats developed by the Institute for Systems Biology, called pepXML and protXML that were somewhat close to what researchers were looking for, Julian said. During last month's meeting in Seattle, a working team of about five people used those formats is a starting point to write a first draft of the new data format.

At first, the new format was named mzIdent. But some researchers noted that the new format would be used not only for identification, but also for quantitative data, so the name was scrapped. In Seattle, researchers decided against using "mz" in the name because they realized the new format should be applicable not only for mass spec peak lists, but for other types of peak lists, such as NMR peak lists, as well. Finally, researchers settled upon analysisXML as the name for the new format.

Julian said that researchers and software engineers would work closely with the Institute of Systems Biology over the next year to refine analysisXML, which is currently available for review. Developers of analysisXML hope to have a final draft of the data format ready by next spring, when they will give a presentation on the format at the PSI's 2006 spring meeting.

Once the schema is stabilized, software developers will be free to start using it.

"It's not trivial, but it's not monumental either for vendors to adopt analysisXML," Julian said.

So far, vendors have been extremely cooperative and generous with their employees' time in developing both analysisXML and mzData, Julian noted.

— Tien-Shun Lee ([email protected])

