Armed with a $750,000 Phase II SBIR from the National Institute of General Medical Sciences, statistical software firm Insightful plans to develop an easy-to-use version of a classification method that shows promise for biomedical data analysis, but that has been inaccessible to most researchers.
In a departure for the firm, Insightful has set out to produce an open-source version of the method that will run in both its commercial S-Plus statistical software as well as the open source R programming language.
The method, called least angle regression, or LARS, was originally proposed by Stanford University’s Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani in a paper published in the Annals of Statistics in 2004.
It was a “remarkable article,” according to Tim Hesterberg, senior research scientist at Insightful and principal investigator on the NIH grant, “because it tied together a number of procedures that had been in the literature and provided a very efficient way to do the necessary calculations.”
LARS is well suited for variable selection applications, Hesterberg said, particularly in cases like microarray analysis, proteomics, and safety analysis, where the number of variables to be analyzed far exceeds the number of observations.
While there are a few academic implementations of LARS, such as one called lars developed by Hastie and Efron and another called glmpath developed by Hastie and Mee Young Park, these packages are “limited in scope and robustness,” according to Insightful.
As a result, the approach has not yet gained wide acceptance among most biological researchers.
Hesterberg is aiming to change that, however. “We intend to produce something that is easier to use, with a more consistent interface,” he said. In addition, the project will improve upon currently available LARS packages “by producing software that is more numerically accurate, and supports additional models,” he said. Specifically, the original LARS software was limited to linear regression, but Hesterberg and his colleagues aim to extend it to logistic regression, survival models, and other nonlinear regression models.
Michael O’Connell, director of life science solutions for Insightful, said that logistic regression is key for biomarker analysis, clinical studies, and many other biomedical applications in which there is a yes/no response, “particularly in the safety analysis area, where you either have an adverse event or you don’t and you’re trying to predict [outcomes].”
Hesterberg said that Insightful also plans to extend the approach to support factor predictor variables instead of continuous predictor variables to enable “categorical predictors.”
O’Connell explained that the combination of logistical regression and categorical variables in LARS should make it a very powerful analysis tool.
He cited the case in which a researcher is analyzing treatment and control microarrays, “and you’re trying to find if you have certain markers that have classifications on them as categories. For example, they might be different pathways — the presence or absence of genes in different pathways,” he said. “You’d have the categorical variables as part of the predictor space and you could predict into a binary outcome like presence/absence in treatment versus control.”
O’Connell added that the original LARS software packages were limited to continuous variables and standard regression, “and in reality, when you’re working with biomarkers and when you’re working in safety data, clinical data, that’s not really the norm. You’re always confronted with binary outcomes and logistic regression, you’re confronted with survival and time-to-event data as responses, and then you typically have a mixture of continuous and categorical predictor variables.”
He added that Hesterberg’s work “is very important in taking this great idea and this great algorithm and actually making it applicable to the types of data that we get in clinical and discovery.”
The company is initially targeting biomarker discovery, safety analysis, and proteomic analysis for LARS. Hesterberg said that his team is working with another NIH-funded research group at Insightful that is developing a biomarker-discovery package to include LARS in that software [BioInform 06-23-06].
A Slow Learner
LARS departs from other classification methods like stepwise regression by using an idea called “slow learning,” Hesterberg said.
“Stepwise regression picks the single best variable and then jumps to the least squares solution with that variable, so it’s an all-or-nothing approach,” he said. “What LARS does is pick the first variable, but makes small changes, moves in the direction of the least squares solution with that variable, but then stops partway as soon as another variable becomes as good as the first.”
The advantage of this approach in practical terms, O’Connell said, is that in areas like microarray or biomarker analysis where there are many potential predictor variables, “you’re allowing each of those [variables] to come in and play a role in the establishment of the predictive model for the clinical outcome of interest.”
As a result, O’Connell said, “We’re giving the predictive set a real chance to show itself, rather than taking just the greedy all-or-nothing response of the usual regression approaches.”
Hesterberg said that LARS also uses a variation of a method called ridge regression to improve its predictions.
Going Open Source
“We’re giving the predictive set a real chance to show itself, rather than taking just the greedy all-or-nothing response of the usual regression approaches.”
Insightful has already released a prototype library called S+GLARS (generalized least angle regression) that runs in both S-Plus and R under the GPL 2.0 open-source license and is seeking collaborators to contribute code.
While the company has previously collaborated with open source projects, like Bioconductor, “this is the first time that we have started a research project with the idea that the outcome of it would be an open-source package,” Hesterberg said.
O’Connell said that the open source license is in line with the company’s next release of S-Plus, version 8, which will be launched in the first half of 2007. S-Plus 8 will include a “new packaging system … that is very much aligned with the packaging system in R that will enable a lot more convenient cross-talk between R and S-Plus,” he said.
From a product-development perspective, opening the project up to the open source community “lets us be a whole lot more agile,” Hesterberg said. “We can create something much faster, make use of work that other people have done, improve on it and let them improve on our work. We will create good, high-quality software much faster than if we try to do it all ourselves.”
Insightful’s business model for the open-source software includes some consulting, as well as the possibility of implementing some of the core algorithms in lower level languages that would be available only through S-Plus.
In addition, O’Connell pointed out that the firm can include specific implementations of the algorithm in its specialized software, such as its Safety Miner package, “so the core algorithm will be open source, but the application to signal detection will be something that’s available through our commercial product.”
Further information on Insightful’s LARS project is available here.
Insightful’s S+GLARS library is available here.