CAMBRIDGE, Mass.--As pharmaceutical companies generate reams of DNA data using microarray chips, demand for software that can analyze gene expression has become more urgent. A team of researchers from the Whitehead Institute at MIT's Center for Genome Research, and colleagues from Dana-Farber Cancer Institute and Dartmouth Medical School, spent the past 18 months developing one such tool. The project was supported by Affymetrix, Bristol-Myers Squibb, Millennium Pharmaceuticals, and the US National Institutes of Health. The resulting Genecluster software is the first in a generation of tools that the institute said it intends to design for microarray data analysis. The team reported its research in the March 16 issue of the Proceedings of the National Academy of Sciences.
Todd Golub, a research scientist at Whitehead and coauthor of the report, told BioInform that, for the pharmaceutical industry to make sense of data generated by microarray chips, "It's likely there's not going to be a single solution." Golub said a suite of bioinformatics approaches will be necessary to understand gene expression data effectively.
Genecluster, he said, is just one offering of several to come that will help speed analysis of the enormous amount of data emerging from genomic research projects worldwide.
Whitehead's new software employs a self-organizing map--an algorithm widely used for mining financial data or other particularly large or messy datasets--to group within minutes genes with similar expression. Golub explained, "We want to be able to have a method that does not require the user to have some previous biological understanding of what the data are going to look like."
That's because employing a method that demands no preconceptions can reveal the most striking patterns of data, whether they were expected or not, he contended. "What we currently understand about gene function and regulatory networks is very limited," he said, adding, "Every day we learn more about unexpected functions of particular genes--genes with newly discovered functions that are different from what they were originally thought to be."
Genecluster works by grouping together different genes that behave similarly in an experiment. Explained Golub, "Genes that are tracking together, moving up and down together during a biological experiment, are grouped together."
The expectation is that genes that are grouped together in such a way are likely to have some biological relationship. "The underlying hypothesis is that if we understand these genetic networks better, we'll understand how they result in causing disease," Golub said. He contended that treating different genes in the human genome not as 100,000 distinct entities, but as groups that function together to perform a specific task, such as turning a normal cell into a tumor cell, is an important first step to understanding biological processes.
Once all genes in the human genome are identified, Golub said the main task will be to find patterns in gene expression data. "Strategies that look for patterns you expect by asking a specific question such as, is this pattern there, yes or no, are already possible with current technology and microarray data," Golub observed, "but what's challenging is to ask, are there genetic patterns in biological data that I never would have thought of looking for." Finding those, he asserted, will be the most exciting application of microarray chips.
Golub said Genecluster can run on any computer platform and is compatible with any brand of microarray chip. "Our data are based on Affymetrix arrays, but the software is not unique to that," he explained.
Tools to come
Researchers at the institute are already at work developing additional tools. "One of the goals going forward is to expand the types of analysis tools that would be useful in interpreting these kinds of complex data," Golub said. Future tools could include methods for using microarray data for classification purposes, such as identifying which of many thousands of genes in the genome are most useful for classifying different types of tumors.
The Whitehead team also continues to work on developing "the best way to visualize these types of data, because if you have thousands of genes and many different samples it's not clear how to best visualize that," Golub added.
Aside from the difficulty that noisy data present to creators of a tool for analyzing microarray data, just merging the scientific cultures of computer scientists with biologists in one operation has been a challenge, Golub said. Whitehead has established a new research group specifically for that purpose that "brings together people skilled in informatics, biology, and genetics," he said.
Distribution terms for Genecluster remain to be finalized by the institute.