Following on the creation of a formal diagnostics division earlier this year, Ciphergen Biosystems last week released an upgrade of its Biomarker Patterns Software (BPS) product. Company officials described the software as a key part of Ciphergen’s strategy for ushering its SELDI ProteinChip platform into the clinic.
William Rich, CEO of Ciphergen, told BioInform that the company considers its analysis software to be just as crucial as its flagship ProteinChip platform in its bid to bring proteomics-based diagnostic tools to market. In order to make accurate — and regulated — biomarker-based diagnostics a reality, Rich said, “the pattern generator has to be highly reproducible and has to generate a broad spectrum of proteins, and our SELDI technology does that. Secondly,” he added, “the ability to analyze and find the subsets of patterns that are truly predictive — which is the software part of the game — is equally important.”
The software upgrade comes at a crucial juncture for proteomics-based diagnostics: Since the publication of an ovarian cancer study by Correlogic, the FDA, and the NCI in the Lancet in 2002 [BioInform 02-25-02], interest in the field has exploded, and a number of companies have developed their own bioinformatics methods for classifying patients based on the protein patterns in blood samples. More recently, however, a number of critics have raised doubts about the statistical rigor and reproducibility of these methods, and in February, when the FDA sent a letter to Correlogic questioning the regulatory status of its software-based diagnostic tool [BioInform 03-01-04], the debate only gathered steam.
Nevertheless, hopes are high in the proteomics community that diagnostic tests based on multiple biomarkers are not only possible, but imminent, prompting companies like Ciphergen to improve upon existing methods. “Ciphergen is very optimistic that we’re going to have tests very soon built around robust algorithms, multivariate algorithms, that are really much, much more powerful than today’s diagnostic tests,” Rich said. With single-marker protein tests for cancer and other diseases being ineffective at best, Rich said that multi-marker diagnostics with high predictive accuracy are “the Holy Grail that everyone is shooting for.”
First Peaks, then Patterns
The latest version of Ciphergen’s proteomics pattern detection software, BPS 5.0, addresses some of the criticisms of previous approaches, which relied on genetic algorithms to sort large sets of mass spectrometry data into cancerous or non-cancerous classes. One fault with these methods, Rich said, is that “they don’t look at the protein patterns themselves; what they look at are thousands of data points along the baseline.” Ciphergen’s approach first determines “discrete peaks” for proteins that are differentially expressed, with “significant” p-values, between samples, and then runs that subset of data points through its multivariate analysis method. In addition, the new version of the software contains a new supervised learning algorithm called TreeNet that “scores” each of the variables in the dataset to help increase the predictive accuracy, Rich said.
Acknowledging the “raging debate” surrounding bioinformatics methods for proteomics pattern detection, Rich said that BPS addresses the issue of “preanalytical bias,” which he deemed “one of the most under-recognized problems out there in this field.” One of the limitations of genetic algorithms for proteomics data sets, Rich said, is overfitting due to “not enough samples and too many data points” — a typical situation for proteomics experiments, which may have only 100 samples or less, with tens of thousands of data points associated with them. While prior approaches claim sensitivity and specificity in the range of 98-99 percent, Rich cautioned that this is likely a symptom of overfitting. “The problem is that genetic algorithms — and neural networks in general — need huge amounts of data, huge amounts of samples, in order to prevent overfitting,” he said.
Kenna Mawk, Ciphergen’s software product manager, explained that TreeNet — a classification and regression tree algorithm developed by Ciphergen’s partner, Salford Systems — addresses the overfitting problem. TreeNet “randomly picks variables to build each successive tree and then learns from that experience, so you really reduce the danger of overfitting, and what you get down to is a set of variables that are the most important,” she said. The previous version of BPS, built upon Salford Systems’ CART decision tree algorithm, might create a model with 100 percent sensitivity and specificity, Mawk said, but those results could drop drastically when run on unknown data. The addition of TreeNet may reduce the sensitivity and specificity of the model, she said, “but it comes up with results that are much more in line with the training model.”
Salford Systems has granted Ciphergen exclusive rights to market its data-mining technology for proteomics applications, Rich said. The company was recently granted a patent on the use of its biomarker pattern discovery software for all types of mass spec data [BioInform 02-02-04], and Rich said it has filed several other patent applications for its software.
With the launch of its diagnostics division in January, Ciphergen signaled its commitment to establishing the ProteinChip Biomarker System, which includes the BPS software, as a staple in the clinical research realm. The company admits, however, that it faces some substantial hurdles in reaching that goal. Currently, BPS software users are in the “low double-digit percentages” of the company’s total customers, Rich said, “but in the future, we see it becoming a very standard product that we sell.” The trick, he added “is moving the world from a single-marker mentality to a multi-marker mentality.”
Ultimately, Rich said, “We have to demonstrate that a multi-marker assay on SELDI can do what the entire diagnostic industry — and I would say the entire clinical research community — has failed to do in the last 15 years.” Improved software may help the company achieve that goal, but observers point out that there are other obstacles that Ciphergen and other proteomics-based diagnostics hopefuls must overcome. “The bigger problem is for spectra to look reproducible across labs and across time,” said Keith Baggerly, a biostatistician at the University of Texas MD Anderson Cancer Center and the co-author of a recent paper in Bioinformatics [2004 20: 777-785] that critiqued current pattern detection methods for mass spectra data. The CART algorithm that underlies BPS “is a reasonable way of looking at patterns,” Baggerly said, “but other methods would work just as well.”
And Ciphergen isn’t the only proteomics platform vendor targeting the clinical market. Last fall, Bruker Daltonics introduced its ClinProt set of tools for mass spectrometry-based biomarker analysis, which comes with the company’s ClinProTools software package. But Bruker is taking a slightly different approach in its software offering. Mark Flocco, business development manager for clinical proteomics and biomarker discovery at Bruker, said that the company opted to provide multiple classification algorithms in the software — some developed by Bruker and some available in the public domain. This choice permits users to compare the results of different analysis methods and select the patterns that appear to be the most biologically relevant. Flocco said that, like Ciphergen, Bruker has tried to address the overfitting problem with its software by making overfitting “not as easy to do.”
Flocco pointed out that software is likely to “make a difference” in the speed with which proteomics-based approaches are used in clinical settings, but added that this timeline may also depend on experimental variability, platform limitations, sample preparation, and other variables, including an important one that often gets overlooked: “the handling of the software in the individual’s hand.”