Laboratories around the world are churning out gigabytes of notoriously messy microarray data, so why is a bioinformatics team at the University of Pittsburgh adding fake gene expression data to the mix? Because the last few years have produced an explosion of methods for analyzing microarray gene expression data, “but no standardization or metric by which those methods can be compared,” said James Lyons-Weiler, an assistant professor at the Center for Oncology Informatics at the University of Pittsburgh Cancer Institute.
Lyons-Weiler said his group supports a large number of biologists who want to analyze their microarray data, but find themselves stymied by the diversity of approaches available to them. “Every month there are another few papers that come out describing new methods,” he said, which makes it difficult for researchers to determine what the best approach may be for analyzing their particular data set. In response, he developed the Gene Expression Data Simulator (http://bioinformatics.upmc.edu/GE2/index.html), an online tool that spits out artificial gene expression data. Unlike data from genuine microarray experiments, where the underlying biology, experimental variability, and other factors that affect the output are unknown to the researcher, “anyone who uses [the data simulator] knows everything about the sources of variability and controls them,” Lyons-Weiler said.
The simulator is useful for biologists who want to compare several available methods to determine which is the most appropriate for their experiment, or even to simulate a microarray experiment before it actually runs to adjust their experimental parameters, Lyons-Weiler said. In addition, bioinformatics developers working on new gene expression analysis tools can use the simulator as “a common frame of reference” to benchmark their methods against alternative techniques.
Using the simulator, researchers can adjust any number of factors known to affect microarray data — from the number of differentially expressed genes, to their intensity, to the level of background noise and other variables. The data set then serves as a yardstick upon which to measure the effectiveness of different methods for normalization, differential gene analysis, and classification.
Lyons-Weiler noted that the simulated data provides only “coarse-grained” features of real microarray experiments. Finer-scale sources of variability like hybridization kinetics of DNA are not available, he said, “but we’re confident that we have the beginnings of a framework in which we can map out neighborhoods in the method space.” In other words, he said, “We’re not trying to find the number-one best method for normalization or classification … but we’re trying to separate the wheat from the chaff.”
In the future, Lyons-Weiler said he plans to add definitions to the simulator’s selection terms to aid biologists in sorting out parameter choices, which can be as puzzling as, for example, “nonlinear, intensity-related heteroscedascity.” For now, he said, even researchers unfamiliar with the finer points of microarray analysis can use the simulator for a hands-on course in how the choice of analytical methods impacts their results. A few runs of the simulator is “the equivalent to thousands of years of running microarray experiments,” Lyons-Weiler said.
The University of Pittsburgh team also hosts an online gene expression analysis tool (http://bioinformatics.upmc.edu/GE2/GEDA.html) that offers a range of methods for normalization, selection of differentially expressed genes, and classification. Admitting that he’s “guilty of developing a few new methods” of his own, Lyons-Weiler said the analysis tool offers J5 and PPST (permutation percentile separability test), two new techniques he developed for finding differentially expressed genes that he was able to refine by test-driving on the simulator.
Lyons-Weiler said that usage for the gene expression analysis tool and the data simulator has grown steadily since they were made available online last November. The average number of hits per day is around 590, and there are around 1,400 regular visitors to the site.
The University of Pittsburgh team hasn’t published any papers on the simulator yet, but Lyons-Weiler said he expects to publish several papers on his new analytical methods over the next few months. Source code for the analysis tool and the simulator is available upon request. “I’ll know the project is a success when people begin requesting source code and setting up mirrors,” he said. “ I’d like to see this system added to and built upon.”