It's a familiar adage in the informatics world that the great thing about standards is that there are so many of them. But that way of thinking is about to change in the proteomics community, which has released a roadmap for merging two leading mass spec data formats into one.
Last week, the Human Proteome Organization's Proteomics Standards Initiative issued a list of deliverables that should enable the new format to be completed by the end of 2006. The format, called dataXML, will include features from two mass spectrometry data formats currently in use: mzXML, from the Institute for Systems Biology; and mzXML, from HUPO-PSI.
The merger of the two formats is expected to alleviate some confusion in the proteomics community that some view as a barrier to adoption of either mzXML or mzData.
"It was very important to have HUPO-PSI and ISB agree to this because they're both seen as the leading groups in proteomics data standards, and one issue with the data standards today was just a bit of unclarity," James DeGreef, vice president of product management at GenoLogics, told BioInform. "Is mzXML going to be the de facto standard, [or] is mzData the community-supported standard? And it sort of made it difficult for large software vendors to get behind and support one or the other or both."
"It was very important to have HUPO-PSI and ISB agree to this because they're both seen as the leading groups in proteomics data standards, and one issue with the data standards today was just a bit of unclarity."
DeGreef added that for many software companies, particularly large ones, "reducing the risk of choosing the wrong format is very important, especially as people have big ISO processes around quality assurance that they have to meet. So it's not an easy decision going one way or the other."
Like with most bioinformatics firms that specialize in proteomics, it made sense for GenoLogics to support both formats, so the company's flagship Proteus software is compatible with both mzData and mzXML. Nevertheless, DeGreef added, "it will help to just support one." In addition, he noted, for larger software firms with a broader focus, or instrumentation vendors, "they wouldn't want to support two. They only want to support one, and because there were two, they support none."
Echoing DeGreef's comments was Adam Rauch, a software developer at proteomics software consultancy LabKey and an affiliate of the Computational Proteomics Laboratory at the Fred Hutchinson Cancer Research Center.
"Right now, there's some confusion in the marketplace over, 'Should I do mzXML? Should I do mzData?' And traditionally when that happens, people do neither. They just kind of wait," Rauch said. "There are some tools that use mzXML, there are some tools that use mzData, but a lot of vendors of instruments and software are just sort of in a wait-and-see position. So having a single standard out there basically gives no one any excuses anymore."
Rauch said that while maintaining software for two formats isn't technically difficult, it can be a "distraction" for many labs. As an example, he said that the Fred Hutchinson Center has a new mass spec instrument, "but it doesn't export in mzXML, which is what our whole system uses right now." While the CPL team is working on a way to convert the data from the system into mzXML, "it's really provided a bit of a roadblock," Rauch said. "They could be using the machine and doing high-throughput data analysis today, but they're not because of this file format problem."
Brian Pratt, vice president of informatics at Insilicos, described the existence of two very similar — yet incompatible — formats within the relatively small proteomics community as a "historical accident."
Randy Julian, chairman of HUPO-PSI's mass spec working group, explained that the two standards were developed with slightly different goals in mind, at around the same time. HUPO-PSI very much wanted to ensure that its mass spec format was compatible with broader XML-based standards being developed by the American Society for Testing of Materials, and also wanted to create an interchange format that would work across different labs. "So the task of the group was to bring in instrument vendors and to do what was necessary to make it easy for the formats to be supported by the instrument manufacturers," he said.
"There are some tools that use mzXML, there are some tools that use mzData, but a lot of vendors of instruments and software are just sort of in a wait-and-see position. So having a single standard out there basically gives no one any excuses anymore."
The result was mzData, which a number of vendors — including Thermo Electron, Applied Biosystems, Agilent Technologies, Bruker Daltonics, Waters, Matrix Science, and Kratos — have since pledged to support.
ISB, meanwhile, was using several mass specs from a number of different vendors and needed a neutral exchange format to move data through its analytical pipeline, and it needed it quickly. Thus mzXML was born.
While the two groups kept abreast of each others' activities — and Julian stressed that "there was never a disconnect, there was never a rivalry between the two standards" — there were some important differences.
"Some of the techniques that were used in the early versions of mzXML were really designed to optimize its use in the lab, and they really inhibited the ability to interchange data using that format," Julain said."They used some very specific non-XML types of technologies that both the vendors and the academic groups that we had assembled viewed as a little bit dangerous if you're asking multiple groups as a community to create these files."
On the other hand, within a single lab like ISB, mzXML offered better performance, making it a better option for some users.
Since the two standards initially emerged in 2004, they have actually begun growing closer together, Julian noted, "so by the time that mzXML 3.0 came out this year … if you took a hard look at what the differences were, technologically, between the two formats, those differences weren't big enough to warrant having tools being built off of two different, roughly identical formats."
Recognizing this trend, the two groups agreed last summer to merge the two formats, and finalized the roadmap at a PSI workshop in San Francisco in April. HUPO-PSI officially released the roadmap at the American Society for Mass Spectrometry conference in Seattle last week (see below for further details on the roadmap and the dataXML format).
Will Vendors Adopt it?
The new standard is welcome news for those instrumentation vendors who have already committed to supporting mzData, and is expected to drive adoption among those who are still on the fence.
Sean Seymour, a staff scientist in the mass spec R&D group at Applied Biosystems/MDS Sciex, noted in an e-mail message that the converged mzData and mzXML format will offer definite advantages for mass spec vendors. "Although all vendors are now supporting the HUPO-PSI format, mzData, many of us have customers who are also using tools that require the ISB format, mzXML. Regardless of what portion of people use one or the other, this effectively doubles a vendor's development and support costs."
Seymour added, "All vendors are in agreement that a single standard is the right thing to do scientifically, and it's equally wasteful for all of us to spend on supporting multiple 'standards' instead of putting that money toward developing better technologies."
Erik Nilsson, president of Insilicos, noted that the field is in "an unprecedented age of productivity in mass spec instrument development, and people [won't] want to do new support for instruments twice if they don't really have to, so I think that having one standard that everyone is behind is going to drive faster adoption by the instrument companies."
LabKey's Rauch said that vendor buy-in is especially important for new instruments, because it's difficult for software developers to keep up with them. "There are converters from these vendor-specific formats into mzXML, but they don't cover every last machine that's out there. And new machines come out, they change their binary proprietary formats, so the converters have to change," he said.
Researchers want a single format that works out of the box and can be integrated with legacy analytical pipelines, Rauch said. "When we make recommendations to our clients, that's one of the things that we're going to recommend. Near the top of the list is exporting to this [format], being able to get the results quickly and easily in this common format. And instrument vendors that lag in that area are probably going to face some difficulty in the marketplace," he said.
— Bernadette Toner ([email protected])
Best of Both Worlds? A Closer Look at the Planned dataXML Format
Randy Julian, chairman of HUPO-PSI's mass spectrometry working group, told BioInform that dataXML is expected to include the best features of mzData and mzXML, while eliminating some of the drawbacks of each one.
One difference between the two, he said, was that "in mzXML it was very easy to tell whether or not the file was valid, but it meant that you had very frequent revisions of the file format, which meant that there were lots of variants of the file out in the field, and people complained about that."
On the other hand, with mzData, "while we have not changed the file format in over a year, we had controlled vocabulary terms and so it's possible to create a 'correct' XML document, but somehow leave out an important controlled vocabulary term, like an instrument variable that's required to understand what's in the file."
In order to address this particular issue, the merged standard will include validation programs that take a file that describes an accepted nomenclature file for mass spectrometry, "and then combine that with an understanding of what the minimum expectations are about what gets reported, and determine whether or not a particular XML file has everything that it needs, and is in fact a valid file and can be read in without fear."
The merged format will also use IUPAC standard nomenclature for describing features of the spectra and features of the instrument, and will be developed with an eye toward digital signatures for clinical proteomics, biomarker discovery, and other research that may at some point be subject to regulatory approval.
The standard will also have "the performance characteristics of mzXML," Julian said. "It will have binary data integrity signature capability as well as binary pointers into individual spectra, which will make it perform almost as well as the vendor-proprietary file format, but instead of using a custom mechanism for doing that, we intend to use XML standards."
The format will also be compatible with new standards being developed for the Semantic web, Julian said, although he added that specific decisions along those lines have yet to be made.
Brian Pratt, vice president of informatics at Insilicos, said that while the "smithing" of the standard may be challenging, the implementation of it should be "pretty simple."
The primary difficulty in the process, he said, "is that every instrument manufacturer has something that makes their machine unique, that gives them a competitive advantage. So how do you describe in a single format all these different small variations? And that's a process that can take a while to put together because you do want to make sure that no one's been left out."
However, he noted, "once all that's been ironed out and all the big players agree that, 'Yes, we feel this format will describe what our instrument can do,' then for the software developer to implement readers and writers — that's not really a big deal."
As far as the timeline for the new format, HUPO-PSI plans to have a UML data model and ontology models ready by August; and documentation, a draft specification of the schema, and an API in place by September. By December, the group plans to have binary indexing and signature programs, a validation program, and reference implementations of converters available.