A team of researchers at Duke University has offered evidence that “data integration” and “clinical genomics” are more than just buzzwords. In a recent study, the group created clinicogenomic models that combined gene expression microarray data with clinical information in order to predict breast cancer recurrence with much higher accuracy than predicted by either data type alone.
Joseph Nevins, head of the department of molecular genetics and microbiology at Duke University Medical Center and co-author on the paper, said the team’s goal was to increase the degree of precision by which doctors predict cancer outcomes. “If 100 breast cancer patients were to walk through the door over the next week or two, one is really dealing with 100 different diseases,” Nevins said. But despite the level of heterogeneity among cancer patients, most methods for determining the likelihood of recurrence place patients into one of two groups: high-risk or low-risk. This is true for methods based on clinical data alone, and even many microarray-based methods tend to follow a similar pattern, Nevins said.
“The reality is that within a group of individuals that has a prognosis of a 50 percent chance of recurrence, there might be individuals that range anywhere from 30 percent to 70 percent. But they’re all lumped into that 50 percent prognosis because of the lack of refinement in the data that can point toward the actual risk for an individual patient,” he said. This broad range of uncertainty has real implications in clinical settings, Nevins noted, when the difference between a 30 percent chance of recurrence and a 70 percent chance of recurrence could determine whether a patient undergoes chemotherapy or not.
Nevins and his colleagues used data from 158 breast cancer patients at the Koo Foundation Sun Yat-Sen Cancer Center in Taipei, Taiwan. For each patient, around 200 clinical data points described tumor size, estrogen receptor status, age, menopause status, lymph node status, and other factors. In essence, Nevins said, “We’re treating the gene expression data as just additional clinical data points” in a statistical model. The team used “metagenes,” which represent subsets of co-expressed genes, in order to reduce the dimensionality of the expression data set while still representing its heterogeneity. These metagenes were fed along with the clinical data points into a statistical framework based on Bayesian classif-ication trees.
In a recent PNAS paper, the authors wrote that the clinicogenomic model “statistically dominates” models built with genomic data alone, with a difference in approximate log-model likelihoods of more than 7. The clinicogenomic model outperformed the clinical predictors alone by more than 27 units on the log-likelihood scale, “indicating the latter to be of no interest at all relative to the clinicogenomic model,” the authors wrote.
Most microarray studies, Nevins said, “are really set up to say, ‘Here’s a gene-expression-based predictor that does better than the clinical.’ And in many cases, that is true, but I think it’s important to not set these studies up as one or the other — there’s no reason to do that if both forms of data contribute information. The idea is to just build a model that combines the two, and build the best model that you can irrespective of what the data is.”
Nevins said that this work is significant in proving that combined predictors are possible, but stressed that the framework is “still more or less a research tool” and “not something that we would commercially develop.” Nevins cautioned that commercial firms that are moving into this area may be acting prematurely. “I don’t think that the studies that have led to those [commercial approaches] have been fully validated, and we’re trying to do our best to make sure that the predictive models being generated have really been validated in a variety of settings to be certain how well they perform — not just in one study, but in several different studies.” Even when the approach does move into the clinic, Nevins said it’s likely that it would initially be used as an “adjunct” method rather than the sole source of assessment.
The model was designed to accommodate any form of information, so the Duke team is currently studying protein expression data for the same breast cancer samples to determine whether proteomics data is “synergistic” with the approach, Nevins said. The team is also applying the method to ovarian cancer and other cancer types.