Simon Says: Study Your Stats


By Aaron J. Sender

Richard Simon, a 30-year NIH veteran, didn’t like what he was seeing. Here was a revolution, an opportunity to burst open the study of genes and their relation to disease like never before. Yet much of the gene expression papers populating journals were devoid of a proper dose of statistical meaning. So Simon, trained as a computer scientist and mathematician, developed BRB ArrayTools, a freely available gene expression analysis package packed with statistical wallop.

“It was out of frustration that so much of the microarray analyses that were being done and published were inadequate and not using good statistical principles,” Simon says. For example, biologists often go cluster crazy. “There is a common misconception that cluster analysis is the method to use for all problems,” says Simon. “Well, clustering algorithms will always give you clusters. So there is a tremendous risk of people publishing papers erroneously saying this is not one disease, but two.”

Ideally, every project team would include an experienced statistician. But since there is a dearth of those, “we’ve been trying to put that statistical knowledge, experience, and expertise in a package to help biologists analyze their own data,” says Simon.

To hide the complexity from the biologist, but keep the statistics powerful, he built BRB ArrayTools as a Microsoft Excel add-in.

“There are a lot of statisticians developing interesting analysis tools. But for the most part they tend to require knowing some complicated statistical language, like R or S+,” says Simon. “We wanted it to be easy to use by biologists.”

Using BRB ArrayTools is as easy as dropping data into an Excel spreadsheet. Pulldown menus allow the user to tap into the sophisticated statistical tools concealed in the back end.

Simon also wanted to provide interactive guidance for the correct use of statistical tools. “Excel represented a way which we could build dialogue boxes to interact with the user,” he says. For example, a dialogue box reminds users to identify multiple samples from the same patients, a common oversight.

“There are a lot of nuances in using statistical methods that could be used wrong,” says Simon. “So we try to remind the user what the assumptions are and give them the tools for doing it right.”

Excel is also portable, platform agnostic, and already on most desktops. “We did not want to be tied to anybody’s particular database structure,” says Simon. “The Excel spreadsheet format represents a universal flat file kind of format.” In response to perceived limitations on the amount of data the program can handle, Simon says, “We can easily analyze up to 250 microarrays with up to about 35,000 genes on each.”

Simon, who heads NCI’s Biometric Research Branch, is now working with Affymetrix to make sure that BRB ArrayTools is compatible with the chip leader’s format. “I expect we will be licensing very shortly their application programming interface that will allow users to import their Affy data automatically,” he says. “But there won’t be any charge for the user. We will never do anything that requires the user to license things from some other source.”

The next version of BRB ArrayTools, due late this summer, will also allow Affy human GeneChip users to analyze the two-chip set as one. “We’re building a virtual chip that will utilize both the A chip and the B chip data,” says Simon.

His lab is constantly evaluating new algorithms and tools for upcoming versions, deciding “which methods really are useful and which are just trendy,” Simon says. “We don’t depend on the customer’s money. So to some extent our point of view is that customers don’t know what they need, except biologically they know best about what their problem is,” he says. “So we don’t have to buy into a fad.” For example, Simon left neural network approaches out. He says they’re inappropriate for microarray data.

“We’re trying to make it as painless as possible,” Simon says. “But microarray data analysis is not easy.”

