Iconix Pharmaceuticals' flagship technology is a toxicogenomics database called DrugMatrix that includes gene expression data for rats treated with 630 different drugs and toxic chemicals. The company markets the database itself, as well as analytical services based on analyzing customer data, but it has set its sights on eventually expanding its business model to include diagnostic development.
In order to do so, Iconix needed to find a way to reduce the vast amount of information in DrugMatrix down to small sets of genes that will provide a quick — and accurate — measurement of biological response to chemical treatments. These short gene lists, which Iconix calls "drug signatures," are expected to be "more practical" for diagnostic developers than complete expression patterns, according to Georges Natsoulis, senior director of advanced technology at Iconix.
For questions involving biological response to drugs or chemicals, "what you'd like to get is an answer that does not involve a pattern of 10,000 genes," Natsoulis said, "because if you have a pattern of 10,000 genes, then you need to measure 10,000 genes in your next sample to find out if it has or doesn't have the properties" that you're looking for.
But striking the right balance between a shorter gene list and a high-performance classification method took a bit of work. Natsoulis said that his team was not only looking to reduce the number of genes needed to identify a drug response, but was also looking for genes that had a linear relationship to each other so that their weighted expression values could be easily interpreted by a biologist.
Opting for a linear classifier required a trade-off, Natsoulis said. "If you use a nonlinear classifier, it would be slightly better at classifying, but then it wouldn't be interpretable — so we were striving for some sort of balance where we'd get the best classifier possible, but make it short and interpretable."
In a study published in the May issue of Genome Research, Natsoulis and colleagues from Iconix, SPSS, and the University of California, Berkeley, evaluated six different linear (t-rank, sparse linear programming, sparse logistic regression) and nonlinear (decision tree, Gaussian kernel support vector machine, and neural net) supervised classification algorithms to determine the best approach for meeting that goal.
According to the Genome Research paper, the t-rank algorithm was the worst performer, resulting in signatures that were "readily interpretable, [but] mediocre classifiers of the groups and only successful on the groups with many, large and distinct gene expression changes." At the upper end of the performance scale, Gaussian kernel SVMs and neural nets proved to be the best classifiers, but didn't offer the interpretability that Iconix required.
"Neural nets are good classifiers, but you can't interpret the results, you can't reduce it to practice — so what good are they?" Natsoulis said. He added that decision trees were "quite simple to represent," and were "in a sense interpretable," but didn't work well for cross-validation, which is used to check whether an algorithm performs as well on the test set as it does on the training set. The decision tree worked well in the training set, he said, "but not at all on the test data."
Iconix determined that the sparse linear programming and sparse logistic regression algorithms offered the best mix of performance and interpretability. The company developed a modified form of these methods for its own internal use in deriving drug signatures, Natsoulis said.
The company has identified several hundred drug signatures so far. "We don't claim we found the shortest possible signature," Natsoulis noted, but the company is confident that it has found a way to derive signatures that are both "short and offer almost maximum performance."
One advantage of the company's method, Natsoulis said, is that it can further reduce a gene set "without a loss in performance" by using the genes from an initial signature computation as input for a second round of calculations. In an example described in the paper, a first round of analysis reduced a gene expression pattern with more than 9,000 genes down to 29 genes. The second round further trimmed the signature down to a lean 7 genes.
Natsoulis said that this feature should prove especially valuable as Iconix pursues its ambition to move toward diagnostics. "It's a huge reduction in the search space," he said, noting that for a 10,000-gene set, there are 100 million possible two-gene combinations that could be explored using a brute-force approach. A three-gene signature space would be just "too large," he said. "That's why we need to use these algorithms."
As far as the company's diagnostic plans go, "we don't envision creating an assay platform," Natsoulis said. Iconix is more likely to stick to building intellectual property around the drug signatures that it identifies within its database that it can develop in partnership with a diagnostic shop.
In a newer development since the paper was published, Natsoulis said that Iconix has found that quite a few of the drug signatures that it has identified "overlap" — or contain many of the same genes. This presents an opportunity, he said, for the creation of devices that could distinguish many different end points using only a small set of genes.
He stressed that this is still early-stage research, however, and that while it may prove attractive to diagnostic developers in the future, "We're not there yet," he said.
— Bernadette Toner ([email protected])