With intent to harmonize the cacophonous chords of microarray data being produced by different bands of researchers, biologists and bioinformaticists met for the Microarray Gene Expression Database Group’s (MGED) fourth annual meeting last week.
The confab ended with many strings still to be tuned, but attendees — many of them first-timers who were still completing their PhDs when a tiny cluster of array researchers formed MGED at the European Bioinformatics Institute in 1999 — agreed that it had been far more useful than other recent microarray meetings.
Crispin Miller, a young bioinformaticist from the Paterson Institute in Manchester, UK, expressed his satisfaction with the way microarray researchers were “coming together as a community” to address the major issues they face, including finding a standard platform for exchanging data, figuring out how to make sense of it, and discovering more fruitful ways to design and implement experiments.
“There are so many problems in [microarray] bioinformatics that people know they have to come together to solve them,” he said.
People who attended sessions of MGED’s working groups saw that solving these problems collectively is about as easy as getting Olympic judges to agree on a skating performance. The working group that is developing the common data exchange format, Microarray Gene Expression Markup Language (MAGE-ML), got bogged down in heated discussions of how ontologies, controlled vocabularies for specific types of experiments, would be included in the format. Another working group, tentatively titled the “data normalization” working group, spent much of its time discussing the basics such as what constitutes raw data and what constitutes normalization, according to participants in both sessions. In fact the normalization group’s name is at issue as well, with Paul Spellman of the University of California, Berkeley suggesting it be called the “data processing” group because microarray data cannot really be normalized (it doesn’t reliably fit into a normal distribution). The only thing that did not seem to stir much controversy was the Minimum Information About a Microarray Experiment standard — a list of must-have items in every description of a microarray experiment that MGED approved last year and published in the December 2001 issue of Nature Genetics.
Still, many researchers said they were not making their databases “MIAME-compliant,” which was the official goal of the MIAME standard. John Quackenbush, the free-spirited TIGR researcher who led the normalization working group, indicated that he was not trying to make his microarray database hew rigidly to the rigors of MIAME. Jennifer Weller, who develops the Virginia Bioinformatics Institute’s array database, joked, “I am occasionally compatible, but I am not compliant.” Weller noted that the VBI database will include other information besides arrays, and that although the array descriptions will include most of the MIAME features, there are some aspects of MIAME that she did not agree with and therefore would not implement. (At last year’s MGED meeting held at Stanford, MIAME was dubbed by some the “maximum information about a microarray experiment” because of its exhaustive list of items to be included in the experiment.)
This lack of total agreement, however, was in keeping with the ‘open-source,’ grassroots spirit of MGED, said Terry Gaasterland, the Rockefeller University computational biologist who chaired the meeting. MGED’s strength comes in “being open about the fact that the subject is a moving target and the problems keep changing form,” she said.
Unlike other meetings, where bioinformatics experts and biologists can often speak over one another’s heads in the jargon of their respective specialties, this meeting was marked by collaboration between the two groups, accompanied by patience on the part of the bioinformatics and statistics experts in explaining their numerical alchemy to the biologists. This collaboration came out of what Gaasterland sees as the two groups’ shared “overarching vision: figuring out ways to use the data for microarray experiments to decode the systematic workings of the cell.”
Dear Old MAGE
One of the most painstaking and perhaps useful meetings of the conference was one in which members of an MGED working group laid out the fine details of MAGE-ML that they have hammered out over the past year. Spellman, the working group’s leader, presented a series of 14 diagrams showing scores of interconnecting boxes that make up each “package” in the language, which refers to a specific class of information, such as array design, array fabrication, experiment, and higher level analysis. These categories rest on the MIAME standard.
“The goal is to be able to communicate microarray data from one group of people to another group within the same file format,” said Spellman.
Spellman and other members of the working group then reviewed just exactly what type of data goes in each box, such as the description of the array and its features, sample, treatments performed on samples, the experimental parameters, or the hardware used in the experiment.
MAGE-ML is designed to allow a researcher to go back to trace a result on a sample back to the original source and see the sequence of treatments performed, and the precise experimental conditions. Ontology plays a prominent role in MAGE (which may explain why it is a source of debate), as development of controlled vocabularies for different types of experiments and samples can allow different data files to be compared more reliably.
“We would like people to submit ontologies,” said working group member Michael Miller of Rosetta BioSoftware, “so we can compare data and say’ this tissue is from the same part of the rat as [this other] one.’”
MAGE-ML also gives researchers a standard format to submit results of clustering analysis, and thus enables users to search for clustering results of a given set of genes, or similar clusters — a feature that goes even further than the MIAME standard.
Given that MIAME and MAGE-ML are voluntary, however, some attendees openly wondered how they were going to convince the biologists they work with to enter all the data required into a MAGE-ML based database. Spellman suggested that bioinformaticists build a set of forms, or use the ArrayExpress, which a group from the European Bioinformatics Institute unveiled at the meeting later.
ArrayExpress in Beta
ArrayExpress is a “front-end” interface that includes standard simple web forms for submitting MIAME-compliant data. “The data input happens in MAG- ML, and the data output in HTML,” explained Jaak Vilo, of the EBI’s microarray informatics team. The EBI made ArrayExpress publicly available for download February 12th, but is still fine tuning it, according to Alvis Brazma, head of the EBI’s microarray informatics working group.
Once the system is out of this “beta” phase, the group will be taking submissions from researchers around the world to fill the database with microarray expression information, make it publicly accessible, and curate it. The curators will flag experiments that are MIAME-compliant, said Vilo.
Currently, the database includes human and yeast data from EMBL and S. pombe data from the Sanger Center. But efforts are underway to add TIGR’s array data, information on Affymetrix chip design (including oligonucleotide probes, as mentioned last week in BioArray News); mouse array data from HGMP, other data from Sanger, and Mosquito array data from EMBL.
Meanwhile, other public data repositories are on their way toward becoming MIAME-compliant (or supportive, depending on your semantic proclivities). National Center for Biotechnology Information bioinformaticist Alex Lash discussed NCBI’s progress on its gene expression omnibus data repository (www.ncbi.nih.gov/geo), which was started in late 2000, and now includes 1,084 microarray samples. As soon as MAGE-ML is sufficiently refined, NCBI is planning to integrate this file exchange format into the database as well.
Despite the meeting’s decidedly non-commercial roots, Jason Goncalves of Iobion Informatics gave a presentation on the Gene Traffic database, which is designed to be a local repository of microarray information unlike ArrayExpress, but also comports with MIAME in its setup of data forms. “If your local database is MIAME-compliant, you can easily export to a MIAME-compliant external database,” said Goncalves.
Currently, however, the microarray community is still uploading the concepts of MIAME and MAGE-ML into its mental hard drive. The MGED IV meeting is likely to have accelerated these efforts, and in the meantime, give those who don’t already hit the pillow every night dreaming about bioinformatics much to think about. “As a non-bioinformaticist, this conference was very informative,” said presenter Steffan Hopf of Harvard Medical School. “If I were on a microarray, I would be very up-regulated now.”