Skip to main content

Little-Known Algorithm May Help Resolve Dimensionality Challenge of Array Analysis

An oft-cited challenge in analyzing microarray experiments is the so-called “extreme dimensionality” of the data that arises when tens of thousands of variables are interrogated across relatively few samples.
Researchers have developed an arsenal of clustering and classification methods to address this issue, and countless papers have been published on the subject with the goal of improving methods to correlate groups of samples with particular phenotypes of interest.
One of the latest entries into the field is actually not new at all, but a modification of a little known algorithm that was developed in Russia in the 1970s called FOREL (short for Formal Element).
Andrey Ptitsyn, a researcher at Colorado State University’s Bioinformatics Center, recently used a version of FOREL to reanalyze several publicly available microarray data sets and gain a bit more insight into the data.
In particular, he analyzed data from a well-known study published in Nature Genetics in 2003 that applied Gene Set Enrichment Analysis that studied skeletal muscle gene expression to identify groups of genes that distinguished diabetic patients from non-diabetic patients.
The FOREL-based approach — an unsupervised clustering method — revealed that the data set was actually a bit more complex than originally thought. Rather than dividing neatly into two discrete clusters of diabetic and non-diabetic patients, the data fell into six clusters — one cluster that contained all the “metabolically sound” individuals, and the remaining five clusters placed along a continuum of disease progression.
Ptitsyn described the resulting cluster pattern as a “comet tail,” with any individual’s distance from the core cluster of “healthy” individuals coinciding with the occurrence of diabetes.
“I think it reflects the reality very much, because we have discovered the path that everyone goes on when they step on this metabolic syndrome progression,” Ptitsyn said. “We don’t know the speed that everybody goes … but we seem to have uncovered the path, and it’s already an interesting result — position along this line is very indicative of the progression toward metabolic syndrome.”
Ptitsyn noted that rather than place these samples along a continuum, the original GSEA approach artificially “sliced” them into two classes — proof that an unsupervised approach like FOREL may be better at determining “the natural consistency of the data.”
Ptitsyn, who came to know FOREL’s developer while studying at Novosibirsk State University in Russia, described it as “a brilliant heuristic idea terribly ahead of time, and thus impractical for any real-life application for decades.”
Ptitsyn said that FOREL was designed to work well in extreme dimensionality, but there was little demand for that capability at the time. In addition, the algorithm was very computationally demanding and therefore impractical for most applications.

“By the time microarrays began to fill the databases, FOREL was practically forgotten.”

“By the time microarrays began to fill the databases, FOREL was practically forgotten,” Ptitsyn said.
Nevertheless, he said that he kept the approach “in the toolbox” and decided to give it a try after seeing a presentation in 2003 that discussed the GSEA data set.
“One thing I was always taught to check is whether what you think is naturally a group is truly a group,” he said. “What you see as a single class of diabetics — is it really one class? Maybe there are two different molecular mechanisms leading to the same diagnosis that may or may not be distinguishable, or maybe there are three.”
Ptitsyn said that he decided to try the unsupervised clustering approach “to see whether the perceived group of diabetics is really homogeneous or whether it has different sub-classes” — a suspicion that ultimately turned out to be true. 
Ptitsyn and his colleagues published a paper on their findings in BMC Genomics earlier this year.  
More recently, Ptitsyn released a freely available implementation of the FOREL-based algorithm through his website.  
“You don’t need to read obscure papers in Russian to get it,” he said.

Filed under

The Scan

Call to Look Again

More than a dozen researchers penned a letter in Science saying a previous investigation into the origin of SARS-CoV-2 did not give theories equal consideration.

Not Always Trusted

In a new poll, slightly more than half of US adults have a great deal or quite a lot of trust in the Centers for Disease Control and Prevention, the Hill reports.

Identified Decades Later

A genetic genealogy approach has identified "Christy Crystal Creek," the New York Times reports.

Science Papers Report on Splicing Enhancer, Point of Care Test for Sexual Transmitted Disease

In Science this week: a novel RNA structural element that acts as a splicing enhancer, and more.