Skip to main content
Premium Trial:

Request an Annual Quote

Baylor Researchers Develop Tool for Analyzing TCGA, CPTAC Datasets

Premium

NEW YORK (GenomeWeb) – Baylor College of Medicine researchers have developed a multi-omics database and analysis tool for exploring data from The Cancer Genome Atlas (TCGA) and Clinical Proteomic Tumor Analysis Consortium (CPTAC) initiatives.

Called LinkedOmics, the resource currently contains genomic, transcriptomic, proteomic, and clinical data for 11,158 patients spanning 32 different cancer types and comprises more than a billion data points. It also includes three analysis modules that allow researchers to explore associations between molecular and clinical attributes within and across cancer cohorts and to place these associations within the context of cellular pathways and networks.

Detailed in a paper published last month in Nucleic Acids Research, the resource is the first to integrate tools for association-based queries with the TCGA and CPTAC datasets in a web-based user-friendly interface, said Bing Zhang, a BCM professor and senior author of the study.

He noted that to date, much of the informatics work around TCGA and CPTAC data has focused on processing the raw sequencing and mass spec data so that it can be accessed by general biologists, and on establishing tools for querying specific analytes of interest.

More recently, he said, he has received a number of inquiries from colleagues interested in looking at associations between different types of data in these datasets.

"For instance, they are interested in a phenotype like survival, and they want to know which genes and proteins are correlated with survival," Zhang said. "Or they are interested in a mutation, and they want to know what downstream proteomic changes may be associated with that mutation."

"Everyone wants to ask these types of questions, and we think these associative questions are the foundation of a lot of biological research," he added. "But for the TCGA and CPTAC datasets, there were no existing tools that could let biologists easily get these answers."

Researchers could, of course, download TCGA or CPTAC datasets of interest and use existing software to perform differential expression analyses or pathway analyses, Zhang said, but, this requires a certain level of expertise not available to all labs who might want to explore these datasets.

"I think the major advance here is that we bring the data and tools together in one place with a very user-friendly interface," he said.

One challenge to exploring the datasets is the multitude of datatypes included. For instance, the TCGA dataset includes mutation, copy number alteration, methylation, mRNA expression, miRNA expression, and reverse phase protein array data for the samples analyzed, along with clinical information, including overall survival time, tumor site, age, histological type, lymphatic invasion status, lymph node pathologic status, primary tumor pathologic spread, tumor stage, and vascular invasion status. A subset of TCGA samples were also analyzed as part of the CPTAC project, and for those samples, mass spec-based proteomic, phosphoproteomic, and glycoproteomic data is available.

The LinkedOmics resource allows researchers to analyze these data using three modules. The first, called LinkFinder, allows users to explore associations between a molecular or clinical measure and all other measures for a given cancer cohort. For instance, the authors noted, researchers might look at the relationship in breast cancer between ERBB2 amplification and protein phosphorylation levels.

The second module, called LinkCompare, allows for comparisons of associations identified in LinkFinder. Researchers can compare different associations identified within the same dataset, or the same associations across different datasets. For instance, the authors wrote, users might compare the proteins associated with KRAS mutations in colorectal cancer to those associated with BRAF mutations in the same disease. Or they might look at genes linked to survival in several different cancer types, or molecules linked to survival in both ovarian cancer copy number data and ovarian cancer proteomics data.

The third module, LinkInterpreter, uses gene set and pathway analyses to place the associations identified in the previous two modules into a biological context. For this analysis, it uses functional data from the KEGG, Panther, Reactome, and WikiPathways databases, along with protein-protein interaction, transcription factor-target, miRNA-target, and kinase-target data.

In the Nucleic Acids Research study, Zhang and his colleagues provided five case studies of the tool, using it to look at the impact of RB1 mutations on mRNA expression in bladder cancer and the impact of HER2 amplification on protein phosphorylation in breast cancer; identify a protein signature linked to poor outcomes in ovarian cancer; identify a gene expression signature linked to survival across 12 different cancer types; and to link the marker APCDD1L, identified via the 12-cancer gene expression analysis, to tumor invasiveness and aggressiveness.

Zhang said that he and his colleagues have begun receiving requests from outside researchers who would like to add their multi-omic datasets to the resource, and that they have added two new datasets from BCM collaborators.

Ultimately, Zhang said, he hopes outside researchers will be able to upload their data independently in the future, but, he noted, quality control remains a challenge in this regard.

"We don't want to open this to everyone to upload their data yet because we want to make sure that at this stage, all the data is carefully annotated," he said. "We actually spent a lot of time even on the TCGA data, especially on the clinical part, where we had to clean it up and make it standardized."

Zhang and his colleagues are now particularly interested in adding datasets that contain drug sensitivity information, he said, such as sets generated from cell line experiments or patient-derived xenografts.

"One limitation of the TCGA data is that we don't really have a lot of treatment response type of information," he said. "But if we can get cell line or PDX data with treatment information, that will add a lot of value."

In addition to growing the database, the researchers are working to expand its analysis capabilities.

"Currently, the association studies are based on univariate analysis," Zhang said. "So of course we want to implement more sophisticated statistical tools to support multivariate analysis and maybe add more machine learning components."