Skip to main content
Premium Trial:

Request an Annual Quote

PSI Unveils Roadmap, Timeline for Merging mzData With mzXML to Create Uniform dataXML Format


The Human Proteome Organization's Proteomics Standards Initiative has unveiled a roadmap and timeline for integrating the open data format mzData with the Institute for Systems Biology's mzXML format to create a single standard.

The disclosure, made last week at the American Society for Mass Spectrometry conference in Seattle, comes one year after the PSI and ISB, with broad support from mass spec vendors, decided to merge the two formats.

The combined format, which will be called dataXML, is expected to be mostly completed by the end of the year.

Numerous vendors that currently support mzData, including Thermo Electron, Applied Biosystems, Agilent, Bruker Daltonics, Waters, Matrix Science, and Kratos, are expected to support the combined dataXML.

According to the PSI, the new format will include an interchange schema that has split data vectors compatible with other analytical interchange formats. It will also use a wrapper schema to support both random access indexes and digital signatures.

"By the time mzXML 3.0 came out this year, if you took a hard look at what the differences were, technologically, between the two formats, those differences weren't big enough to warrant having tools being built off of two different but roughly identical formats."

The new format will also include tools to support developers and users, including: a co-localization program to format XML documents before binary indexes or signatures are computed; a validation program to ensure that the use of controlled vocabulary terms matches MIAPE requirements; an application programming interface that supports several popular programming languages; and abstract data models and other documentation to help software developers who want to implement systems based on the interchange format.

The PSI expects to complete a data model and ontology models in August, while documentation, schema draft specification, and language bindings will be done in September. In December, the group expects to complete binary indexing and signature programs, a validation program, and reference implementation of converters.

Getting There

According to Randy Julian, chairman of the PSI's mass spec working group, the PSI had always kept members of the ISB, in particular lead mzXML developer Patrick Pedrioli, in the loop while developing mzData. The groups decided to merge their formats at an ISB workshop in Seattle last summer, although Julian said that the ISB had already begun incorporating some mzData approaches into mzXML version 3.0, which was released this year.

"By the time mzXML 3.0 came out this year, if you took a hard look at what the differences were, technologically, between the two formats, those differences weren't big enough to warrant having tools being built off of two different but roughly identical formats," said Julian, who is also a scientist at the Indianapolis-based software startup Indigo Biosystems.

Vendors and mzData and mzXML developers agreed during last year's PSI workshop that the two formats should be merged into one so that "we were not diluting our resources in terms of development," Julian said. The integration comes around a year and a half after mass spec vendors began incorporating mzData into their products (see ProteoMonitor 9/9/2005).

"The most important thing, as far as we, as vendors, are concerned is that there is one standard going forward," said Robert Barkovich, a product marketing specialist for bioapplications software at Thermo Electron. "If there is one standard, then we code for one standard. We're concerned about a multiplicity of many different standards popping up."

Thermo was one of the first mass spec vendors to launch a product -- the Bioworks 3.2 proteomics software -- that incorporated the PSI's mzData format into it (see ProteoMonitor 1/28/05). Barkovich said he expects the next release of Bioworks -- Bioworks 3.4 -- to support the dataXML format.

"The most important thing, as far as we, as vendors, are concerned is that there is one standard going forward."

"I think the nice thing about the merger is that you're going to see the various parts of both standards that are the most popular," he said. "The new standard is going to be the best of both" mzData and mzXML.

Sean Seymour, a staff scientist involved in developing mass spec informatics at Applied Biosystems/MDS Sciex, agreed with Barkovich that having a single, merged standard will be better than having both mzData and mzXML.

"Although all vendors are now supporting the HUPO-PSI format, mzData, many of us have customers who are also using tools that require the ISB format, mzXML. Regardless of what portion of people use one or the other, this effectively doubles a vendor's development and support costs," said Seymour. "All vendors are in agreement that a single standard is the right thing to do scientifically, and it's equally wasteful for all of us to spend on supporting multiple 'standards' instead of putting that money toward developing better technologies."

For ABI, dataXML will offer better modeling of MRMs, which is the key scan type in the company's MIDAS workflow, Seymour added (see ProteoMonitor 2/16/2006).

Historically, mzData is based on an effort by the American Society for Testing of Materials to create an XML standard for all instrumental methods of analysis, Julian said (see ProteoMonitor 2/3/2003).

Julian was a member of the ASTM committee in charge of creating such a standard and he was invited to attend a PSI meeting to talk about it. Since the mission of the PSI from the beginning was to create an interchange format that would be generated by instrument manufacturers, vendors were also invited into the PSI group.

At about the same time that the ASTM/PSI standard was being created, the ISB began creating a format of its own to help its researchers deal with the multiple brands of instruments that they had in their laboratories.

"They wanted a neutral format to move from measurement, all the way to the end of their internal pipeline," said Julian.

The ISB's internal pipeline became known as the "Trans-Proteomic Pipeline," and the data format created was called mzXML.

After talking with mzXML author Pedrioli about the ISB's data standard, the PSI concluded that the standard was rather specific to the workflow at the ISB, and not generalizable to everybody in the field of proteomics.

"The conclusion was that there were some very nice features of mzXML and we would find a way to use those, and then there were some features that had high performance, but limited the interchangeability, so we would find alternatives for those components," said Julian.

As mzData moved forward, ISB developers began adopting some of the PSI format's approaches, such as splitting data vectors into two separate x and y pairs, and adopting controlled vocabulary mechanisms.

Once it was decided that mzXML and mzData should be merged, it took about a year for the PSI committee to come up with a roadmap and timeline for the merger.

-- Tien-Shun Lee ([email protected])

The Scan

Study Points to Tuberculosis Protection by Gaucher Disease Mutation

A mutation linked to Gaucher disease in the Ashkenazi Jewish population appears to boost Mycobacterium tuberculosis resistance in a zebrafish model of the lysosomal storage condition, a new PNAS study finds.

SpliceVault Portal Provides Look at RNA Splicing Changes Linked to Genetic Variants

The portal, described in Nature Genetics, houses variant-related messenger RNA splicing insights drawn from RNA sequencing data in nearly 335,700 samples — a set known as the 300K-RNA resource.

Automated Sequencing Pipeline Appears to Allow Rapid SARS-CoV-2 Lineage Detection in Nevada Study

Researchers in the Journal of Molecular Diagnostics describe and assess a Clear Labs Dx automated workflow, sequencing, and bioinformatic analysis method for quickly identifying SARS-CoV-2 lineages.

UK Team Presents Genetic, Epigenetic Sequencing Method

Using enzymatic DNA preparation steps, researchers in Nature Biotechnology develop a strategy for sequencing DNA, along with 5-methylcytosine and 5-hydroxymethylcytosine, on existing sequencers.