Researchers at the Medical University of South Carolina were eager to begin work under the $157 million Proteomics Initiative that the National Heart, Lung, and Blood Institute launched in 2002, but soon came up against a stumbling block. “We needed some kind of a standard or data structure so that we could load all this 2D gel information, and we couldn’t find it,” said Romesh Stanislaus, a post-doc in Jonas Almeida’s bioinformatics research group at MUSC.
So Stanislaus and his colleagues soon got to work creating their own XML-based data standard for 2D gel electrophoresis experiments. A preliminary version of their efforts, AGML (Annotated Gel Markup Language) is currently available through the research group’s website, at http://bioinformatics.musc.edu/agml. The data standard is only one component of a full suite of proteomics software that the group plans to release through its main site (http://bioinformatics.musc.edu/) some time in the next few months. While the timeline isn’t definite, the demand for the software suite is: “We need those tools because we need to analyze the data that is pouring in right now,” Stanislaus said.
XML-based standards are cropping up for all flavors of experimental data — from the well-established MAGE-ML for microarrays and BSML and AGAVE for genomic sequence, to the Protein Standard Initiative’s PSI-ML XML for molecular interactions and BioPax for pathways. So it’s a bit surprising that there wasn’t already some sort of effort underway for 2D gel experiments by the time the MUSC team went looking for one. Stanislaus speculated that the complexity of 2D gel electrophoresis experiments, which involve a large number of experimental parameters and a healthy dose of variability, may have discouraged previous efforts. He added that the MUSC team plans to work with other standards efforts to “make a united effort to create a standard data structure for 2D gels.”
AGML was designed to be user friendly, Stanislaus said, with the user being a wet lab biologist — not a bioinformaticist. So far, MUSC proteomics researchers and their collaborators are using AGML, and the response has been favorable, according to Stanislaus. “They are very happy with it because the interaction they have with the creation process is minimal. They just have to enter the [experimental] information and the file that is generated by PDQuest or Phoretix, and that’s it — AGML is created.”
Like all XML documents, an AGML file includes the experimental data itself along with metadata, or information about that data — in this case, the experimental parameters used to conduct the experiment. An AGML file, Stanislaus said, “contains all the data that is necessary for another researcher to run the same experiment.” In addition, he noted, AGML converts the image data from 2D gel experiments into quantifiable, computable numbers, “and now the numbers can be manipulated any way you want” using standard analytical tools. This capability has come in very handy for MUSC biologists, he said, because “we have all the information stored in AGML, and we can model, we can visualize, we can analyze, we can do statistics — all those things on that data structure.”
So far, AGML converters are available for Bio-Rad’s PDQuest and Nonlinear Dynamics’ Phoretix, because these were considered to be the most commonly used software packages for 2D gel analysis.
By design, the standard is a work in progress. Stanislaus said that he and his colleagues are still hashing out a MIAME-like “minimum set of information” necessary for a 2D gel experiment, and as that develops, AGML can change. “The structure is quite comprehensive, but it’s never a done deal,” he said. “Proteomics keeps changing and the information keeps coming, so AGML is good in that if something new comes along and we have to change the structure, that’s no problem, we just go ahead and add it.”
The MUSC researchers are also building a 2D gel-centric proteomics ontology that will add a higher level of semantic capability to the standard. Stanislaus said the ontology is still in its very early stages of development, “but that’s where we are going.”
Stanislaus said he expects that AGML and the ontology will be welcomed by the proteomics community because they are the first such efforts in the field. Indeed, the “formal” standardization route through the OMG or I3C may not be necessary if researchers embrace the fruits of the MUSC lab. As with the Gene Ontology and other community-based efforts, Stanislaus said, “once it becomes a standard and a critical mass of researchers starts using it, everybody is forced to use it, even if they don’t want to.” In that way, he added, “standards are kind of democratic.”
— BT