One of the great challenges for bioinformatics over the course of the next decade will be erecting a framework that allows the interpretation of transcription and proteomic data in a consistent and informative way. At the moment, large amounts of both types of data are being generated by laboratories all over the world, but little headway has been made in linking the information derived from bioinformatics analysis into the larger, worldwide accumulation of archived analytical results. For now, biologists must interpret their experiments in a vacuum.
Creating archives of information to allow rapid query and visualization of proteomic and transcriptomic information will not be easy. The two fields have developed in near isolation from one another — they use different sequence sets, accession numbers, interchange standards, and terminology. Any effort to unite the two fields will require a good set of standards, and the hardest part will be getting the details right. It’s possible to jury-rig compatibility from each of the standards to another program (more on that later), but that’s not a long-term solution.
I recently attended a meeting organized by the NIH to try to reach some consensus on necessary standards in the sub-field of proteomics associated with the complicated chromatographic and mass spectrometric experimental setups used to produce lists of the proteins in a sample. A wide variety of different experimental and informatics techniques are employed, but they tend to be lumped together under the general name “protein identification.”
A group looking at the very complex problem of how to exchange data on protein-protein interactions has made a lot of headway, and I expect it to arrive at a standard everyone can live with. On the other hand, the group looking at the comparatively simple problem of how to represent a mass spectrum (a histogram made up of mass-intensity tuples) has split into at least two camps girding for what will probably be a war of attrition between competing standards, one European and one American.
Too Many Standards
How could something so simple go so wrong? There already are perfectly acceptable ways to represent these tuples. The most commonly used interchange format is the structured text file specified by one of the pioneering companies in this area, Matrix Sciences. It was invented as an alternative to the proprietary formats provided by instrument vendors, and it has been generally accepted because it is easy to parse and rigid enough so that you can’t express the same thing in too many different ways.
To invent an XML to replace this simple format, the Human Proteome Organization impaneled a subcommittee of its Proteomics Standards Initiative. Now, after several years of deliberation, the committee has published the schema for its approved XML dialect, mzData. A code fragment representing a very simple spectrum in mzData would look something like this:
The data tags enclose Base64 encoded floating point numbers. The syntax is similar to the older General Analytical Markup Language, but the structure of mzData is defined so that it can only be applied to mass spectrometry data.
Not to be outdone, a group from the Institute for Systems Biology has created its own standard, a dialect called mzXML, as part of their ironically named “MS glossolalia” project. This dialect would record the same information in a slightly different way:
This format has similarities to mzData, the main difference being that rather than having the elements of the tuple separated into two items, each tuple is represented as a pair of Base64 encoded numbers, designated by the value of the pairOrder attribute.
The two formulations have a similar structure, mzData being somewhat more verbose. However, as the W3 XML Commandment says, “Terseness in XML markup is of minimal importance.” More important from the point of view of adoption is another W3 XML commandment: “It shall be easy to write programs which process XML documents,” and it is here that both standards run into trouble. To reduce file sizes, the choice of Base64 encoding may have seemed like a good idea, but differences in the handling of white space in Base64 and XML often make parsing it difficult. The current dialects of mzData and mzXML try to skirt the issue by excluding white space in the encoded text, but that means that an mzXML or mzData XML document considered valid by XML and Base64 rules may be declared invalid by a dialect-specific parser. This also means XSLT — the accepted means for translating one markup language into another — cannot be used to translate easily between mzXML and mzData, requiring a platform-specific script to convert between the two representations and check for white space dialect compliance.
Is all lost? Can we never agree? Fortunately, Rob Craig, my chief programmer, and Patrick Lacosse from the Department of Medicine bioinformatics group at Laval University rigged up compatibility for both mzXML and mzData to one of our open source projects, adapting a C++ mzXML interpreter available at http://sashimi.sourceforge.net to read both standards.
The self-documenting nature of XML, the fact that only a few tags enclose useful data for a particular application, and the availability of some good open source code mean that so long as the standard dialects are at least valid XML, it is pretty easy to use either one (or both). Here’s hoping for a better solution down the road — and, in the meantime, no further additions to our mix of standards.
Ron Beavis has developed instrumentation and informatics for protein analysis since joining Brian Chait’s group at Rockefeller University in 1989. He currently runs his own bioinformatics design and consulting company, Beavis Informatics, based in Winnipeg, Canada.