An oft-cited challenge in analyzing microarray experiments is the so-called “extreme dimensionality” of the data that arises when tens of thousands of variables are interrogated across relatively few samples.
Researchers have developed an arsenal of clustering and classification methods to address this issue, and countless papers have been published on the subject with the goal of improving methods to correlate groups of samples with particular phenotypes of interest.
One of the latest entries into the field is actually not new at all, but a modification of a little known algorithm that was developed in Russia in the 1970s called FOREL (short for Formal Element).
Andrey Ptitsyn, a researcher at Colorado State University’s Bioinformatics Center, recently used a version of FOREL to reanalyze several publicly available microarray data sets and gain a bit more insight into the data.
In particular, he analyzed data from a well-known study published in Nature Genetics in 2003 that applied Gene Set Enrichment Analysis that studied skeletal muscle gene expression to identify groups of genes that distinguished diabetic patients from non-diabetic patients.
The FOREL-based approach — an unsupervised clustering method — revealed that the data set was actually a bit more complex than originally thought. Rather than dividing neatly into two discrete clusters of diabetic and non-diabetic patients, the data fell into six clusters — one cluster that contained all the “metabolically sound” individuals, and the remaining five clusters placed along a continuum of disease progression.
Ptitsyn described the resulting cluster pattern as a “comet tail,” with any individual’s distance from the core cluster of “healthy” individuals coinciding with the occurrence of diabetes.
“I think it reflects the reality very much, because we have discovered the path that everyone goes on when they step on this metabolic syndrome progression,” Ptitsyn said. “We don’t know the speed that everybody goes … but we seem to have uncovered the path, and it’s already an interesting result — position along this line is very indicative of the progression toward metabolic syndrome.”
Ptitsyn noted that rather than place these samples along a continuum, the original GSEA approach artificially “sliced” them into two classes — proof that an unsupervised approach like FOREL may be better at determining “the natural consistency of the data.”
Ptitsyn, who came to know FOREL’s developer while studying at Novosibirsk State University in Russia, described it as “a brilliant heuristic idea terribly ahead of time, and thus impractical for any real-life application for decades.”
Ptitsyn said that FOREL was designed to work well in extreme dimensionality, but there was little demand for that capability at the time. In addition, the algorithm was very computationally demanding and therefore impractical for most applications.
“By the time microarrays began to fill the databases, FOREL was practically forgotten.”
“By the time microarrays began to fill the databases, FOREL was practically forgotten,” Ptitsyn said.
Nevertheless, he said that he kept the approach “in the toolbox” and decided to give it a try after seeing a presentation in 2003 that discussed the GSEA data set.
“One thing I was always taught to check is whether what you think is naturally a group is truly a group,” he said. “What you see as a single class of diabetics — is it really one class? Maybe there are two different molecular mechanisms leading to the same diagnosis that may or may not be distinguishable, or maybe there are three.”
Ptitsyn said that he decided to try the unsupervised clustering approach “to see whether the perceived group of diabetics is really homogeneous or whether it has different sub-classes” — a suspicion that ultimately turned out to be true.
Ptitsyn and his colleagues published a paper on their findings in BMC Genomics earlier this year.
More recently, Ptitsyn released a freely available implementation of the FOREL-based algorithm through his website.
“You don’t need to read obscure papers in Russian to get it,” he said.