Skip to main content
Premium Trial:

Request an Annual Quote

Breast Cancer Survival Prediction Bolstered by Genome-Wide Expression, Methylation Data


NEW YORK (GenomeWeb) – A recent study from researchers at Michigan State University and elsewhere showed that analyzing whole-genome expression and genome-wide methylation data from primary tumor samples provides a better indication of patient survival in breast cancer cases compared to the standard approach of evaluating clinical information such as cancer stage and molecular subtype.

According to a paper published last week in the Genetics Society of America's Genetics journal, the authors sought to explore whether whole-genome data could improve breast cancer survival prediction when used alone or in combination with data that clinicians use to assess patient outcomes such as age and ethnicity, tumor size, ancestor cell type, and cancer stage.

Existing tests such as Genomic Health's OncotypeDX panel analyze the expression of subsets of cancer-linked genes to assess patients' risk of recurrence. But rather than pre-select which handful of genes might best predict patient survival, "we used data from all the genes present in the cancers — approximately 17,000 in our study — and let our computational model select the informative ones," Ana Vazquez, a professor at Michigan State University and lead researcher on the study, explained in a statement. 

She further told GenomeWeb that she and her colleagues focused on breast cancer specifically because of the large inter-tumor variability observed in patients. "There is a great variability in tumors in [terms of] which tumors are aggressive and lead to metastasis or relapse and death of the patient and which tumors are actually not aggressive," she explained. Currently, about 80 percent of breast cancer patients are treated with aggressive chemotherapy, radiotherapy, or other adjuvant therapies post-surgery since physicians cannot predict which tumors will recur or metastasize. But breast cancers recur or metastasize in only about 40 percent of patients.

In fact, a separate study that assessed breast cancer patients five years post-treatment found more non-cancer related deaths in the patient population than cancer-related ones, according to Vazquez. This means that most patients that currently are being treated or have been treated with adjuvant therapies post-surgery, with unpleasant side effects such as infertility and heart damage, do not really need them. "We want to be able to predict as accurately as possible which patients will benefit from adjuvant therapies and which ones will not," she said.

For the study, Vazquez and her colleagues analyzed primary breast cancer samples from 285 patients whose samples were collected as part of the Cancer Genome Atlas project with a minimum follow-up time of three years. Omics datasets used for the study included gene expression profiles sequenced on an Illumina HiSeq 2000, copy number variants generated on Affymetrix genome-wide SNP arrays, and methylation data from the Illumina Infinium HumanMethylation450 Beadchip. They also gathered clinical information on the patients' tumors including histologic type, subtype classification, and cancer stage. They then used these datasets to build and train computational models that could predict patient outcomes.

Specifically, they developed a statistical framework called the Bayesian Generalized Additive Model (BGAM) that they used to predict the probability of survival after a breast cancer diagnosis and treatment. The models created using this framework can be implemented using the R-based Bayesian Generalized Linear Regression package. The researchers then conducted a series of studies to assess the predictive accuracy of their models when the input was whole-genome expression data or methylation data or clinical data alone or some combination of inputs. 

To train and test their risk assessment models, the researchers divided their dataset into two random groups and used one group to build and adjust their predictive models and the second group to test the accuracy of the model's performance. To assess the accuracy of their models' predictions, they repeated this process several hundred times for each comparison study, each time dividing the omic and clinical datasets in new ways. They then scored the results from the models by comparing them to an average value and by looking at how often a given model performed better than another on the dataset in question.

According to their results, when the researchers incorporated whole-genome expression data from tumor samples into their models along with clinical information, their survival predictions improved by as little as two points to as much as seven points in area under the curve values over assessing clinical information alone.

Specifically, they found that whole-genome gene expression data is a better predictor of survival than any single source of current information currently used by doctors including cancer stage and molecular subtype. They also found that combing whole-genome expression data with clinical data also resulted in better predictions than all of the clinical predictors combined. Also, predictions based on whole-genome expression data outperformed predictions based on genes in the Oncotype Dx panel in the subset of patients that met the criteria for testing with the panel.

They saw the same effect when they repeated their analysis using genome-wide methylation data. According to the paper, methylation data alone also proved more predictive than all the standard clinical information. The predictions improved further when the researchers combined methylation data with clinical data. However, they note in the paper that additional studies are needed to assess whether the association between methylation and survival is due to carcinogenic factors that affect methylation pattern and breast cancer progression at the same time or due to mediation, meaning that the effects of carcinogenic factors are mediated by methylation.

They researchers also found that combing all three kinds of data — clinical, whole-genome expression, and methylation data — resulted in the most accurate predictions of patient survival. Furthermore, the models' prediction accuracy improved when the researchers separated the samples by breast cancer subtype. For the results reported in this study, the researchers did not segregate the samples in that way because that would have reduced the sample size but they do report that in one test where they focused on a specific breast cancer subtype, "we detected gains in prediction accuracy that were considerably larger then when all subtypes were considered jointly."

"Overall, we can conclude that the predictions keep improving as you add omics data" with the greatest effect seen by considering genome-wide omics data, Vazquez said. However, this does not negate the value of tests such as OncotypeDX or Agendia's MammaPrint, which focus on a subset of genes. "Pre-selecting a few genes is better than not using any genes at all," she said. But "we can see that the improvement [in prediction accuracy] is better than using whole-genome. All the genes that are expressing in the tumor are important [and] we believe that we are losing variability by pre-selecting a few genes."

The study also showed that not all types of genomic information are as valuable for predictive models. For example, combining clinical data with microRNA information, there was no change in the predictive accuracy of the models, the researchers wrote, although they speculated that the absence of an association between miRNA and survival was due to small sample size. They also tried combining clinical data with copy number variant information and found that this resulted in some improvement in the predictive accuracy of the models but not as much as with using gene expression or methylation data.

The method is promising but there is still more work to be done before it can be used in the clinic. For instance, the researchers will need to develop and validate the models using data from thousands of patients rather than hundreds. Tests like OncotypeDX, for example, have been validated multiple times in several studies, Vazquez pointed out. Even though this study demonstrates improvement in survival prediction with genome-wide omics data, one study will not be enough to convince pathology labs to include this more comprehensive data in their tests. "When you think of the variability that tumors have, [in any given cancer subtype] you may have thousands of different tumors," she said. "So the more data that you have [in your model] the more likely [it is] … that one of these tumors will be similar to yours. But if your training data is small, you may not find any tumor similar to yours."

Also, for the computational models to work with new patient data, they need to be trained not only with data from a much larger population but also one that has been observed over much longer period of time. One of the problems the researchers had with the TCGA data is that researchers have not collected follow-up data on participating patients for very long post-treatment, Vazquez said. For the models to be even more accurate, "you need 10 to 15 years [of]follow-up data [on patients]," she told GenomeWeb. The TCGA data collected so far covers only two to three years of the contributing patients' journey so it's still "very young."

Furthermore, pathology labs need to adopt standard testing platforms. That's because there are technological differences between platforms that can introduce noise into the computational models trained on data from a different platform. As such, once the models have been trained, that same testing protocol and platform used to generate the test data has to be used for new patient data, Vazquez said. These labs also need in-house bioinformatics teams who can process the data and run the models, she added.

In addition to testing the data on a larger pool of subjects, Vazquez and her colleagues are also exploring ways to incorporate other kinds of data into the models such as treatment regimens and assessing their effects on outcome predictions. They also plan to expand their analysis to include other kinds of cancer subtypes beyond breast cancer using data from TCGA and other cancer projects. In total they hope to have over 1,000 patients for this second round of testing but if they can get more, that will be better, Vazquez said. They will also evaluate which stage of cancer would most benefit from this kind of analysis as part of their next steps.

The project is supported by multiple grants including several National Institutes of Health grants, a National Science Foundation grant, and an institutional research grant from the American Cancer Society.