Skip to main content
Premium Trial:

Request an Annual Quote

Analysis of TCGA Samples Finds Molecular Data Adds Little Predictive Power to Clinical Models


NEW YORK (GenomeWeb) – Researchers with the National Cancer Institute's Cancer Genome Atlas project this week published an analysis of the clinical utility of molecular data – including proteomic data – collected through the initiative.

Detailed in a paper in Nature Biotechnology, the analysis found that molecular data added little predictive value to conventional clinical factors like gender, age, and tumor stage and grade. However, as Han Liang, a researcher at MD Anderson Cancer Center and author on the study, told ProteoMonitor, the analysis is not intended as a final word on the TCGA data, but rather an initial pass aimed at investigating how traditional clinical measures can be made more useful with the incorporation of molecular information.

And while the study findings showed little gain from the use of molecular data in general, proteomics fared somewhat better, as the one molecular model that predicted patient survival as effectively as the standard clinical variable model was protein expression data in lung squamous cell carcinoma.

Liang and his colleagues looked at TCGA molecular data including copy number variation, DNA methylation, mRNA expression, microRNA expression, and protein expression in four tumor types: kidney renal clear cell carcinoma (KIRC), glioblastoma multiforme (GBM), ovarian serous cystadenocarcinoma (OV), and the aforementioned LUSC.

For each of these cancer types the researchers put together a core set of TCGA samples all of which had information available on overall patient survival time, clinical variables, and at least four out of the five types of molecular data collected by the study.

Taking these core sets, they then used Monte Carlo cross-validation to determine the power of the various molecular data types or clinical variables in predicting patient survival. Randomly dividing the sample sets into training and test sets 100 times, the researchers built their predictive models using both the Cox and random survival forest methods.

In all four cancer types, the models using only clinical variables showed significant predictive power, Liang said. On the other hand, the molecular models showed statistically significant predictive power in nine of 18 cases, and only in the case of the LUSC protein expression data did a molecular model have comparable predictive power to the clinical data-based models, giving a concordance index (C-index) of .632 compared to a C-index of .626 for the clinical model (C-indexes ranging from 1, indicating perfect accuracy, to 0.5, which indicates the same accuracy as a random guess.)

Furthermore, when Liang and his colleagues looked at whether the molecular models added predictive power to the clinical models, they found that in only three cases did molecular data increase predictive power – and, Liang noted, these improvements were relatively small.

In KIRC, the addition of mRNA data to the clinical model raised its predictive power by 4 percent. In OV, the addition of miRNA data raised the predictive power of the clinical model by 13.7 percent. And in LUSC, the addition of protein expression data raised the predictive power of the clinical model by 23.9 percent.

Examining the LUSC protein data, the researchers found that higher expression of pMEK1, pMAPK, and p56 were associated with shorter survival time, suggesting that high-risk LUSC patients have higher activation of the RAS/MEK/MAPK pathway and that this pathway might prove a therapeutic target for these patients.

The TCGA protein data was generated using reverse phase protein arrays by a team led by MD Anderson researcher Gordon Mills, who was also an author on the Nature Biotechnology paper. Mills is currently building an RPPA-based database named the Cancer Proteome Atlas of proteomic profiles in cancer samples that includes data on more than 500 cell lines and roughly 4,500 patient tumor samples, most of them from the TCGA project. In addition to the patient data currently included in the database, Mills and his colleagues have collected RPPA-based proteomic data on more than 70,000 patient samples, with primary areas of focus including leukemia as well as lung, head and neck, ovarian, endometrial, and breast cancer.

Liang said that in addition to Mills' RPPA data, he hopes to perform similar analyses using mass spec data, which, as an unbiased discovery approach, could enable researchers to look at many more features than the antibody-based RPPA technique, which is typically limited to measurements in the range of low hundreds of proteins.

Traditionally, mass spec has struggled in clinical tumor sample work as the small size of such samples limits the sensitivity and effectiveness of such analysis – though progress has been made on that front.

Another potential challenge presented by the mass spec approach is the amount of proteins measured, Liang said, noting that in order to build effective models from this data it will likely be necessary to increase the sample sizes used.

"That is maybe a limitation, because if the number of [proteins] for feature selection is increased but the sample size is still limited, then the feature selection will not be effective," he said. Even in the case of the Nature Biotechnology study, "the sample size was relatively limited compared to the number of [molecular] features we profiled," he said.

In the study the researchers looked at core sets of 210 GBM samples, 243 KIRC samples, 379 OV samples, and 121 LUSC samples.

Despite the relatively poor performance of the molecular data, Liang said he expected that future analyses would generate more effective models.

"We see this as just a starting point," he said. "We just used a basic [bioinformatic] approach to do the analysis and then make [that analysis] available to the community."

Perhaps a more important goal than the analysis itself, Liang said, was to provide the larger community with a transparent look at the TCGA data and his team's initial effort in using it to build predictive models.

"We've provided not only the TCGA data, but also our own source code, so people can download the data and the code and they can also upload their [analysis] results and their code, so that their prediction results can be assessed," he said.