NEW YORK (GenomeWeb) – Researchers from the National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium (CPTAC) have demonstrated that protein-level data may be more accurate than transcriptome data for predicting gene function.
Detailed in a study published last week in Molecular and Cellular Proteomics, the findings indicate that proteomic information should, when possible, be included as part of gene function analyses, said Bing Zhang, a researcher at Baylor College of Medicine and senior author of the paper. The results also suggest that, despite longstanding concerns, the quality of proteomics data is as good as that of transcriptomic data generated via methods like RNA sequencing.
As Zhang and his co-authors noted, a common approach to predicting gene function is to analyze the co-expression of various mRNAs under different conditions, the idea being that gene products with similar expression patterns are likely to be functionally similar.
Researchers have traditionally analyzed mRNA for such work due to the relative ease of transcriptomic profiling and the large collections of transcriptomic data available. With the continuing improvement in mass spec-based proteomics technologies, though, it has become plausible to use protein-level data for such analyses.
Because proteins are the molecules that ultimately carry out biological functions, "intuitively one might think that protein data would be better" for gene prediction work, Zhang said, butthere have been lingering concerns regarding the quality of large-scale proteomics data.
"Though we can now generate genome-wide proteomics data, still people have the sense that proteomics data quality is not as good as RNA-seq data," he said. He added that this sense has also played into interpretations of the frequently observed discordance between mRNA and protein expression.
"Many papers have reported low correlation between mRNA and protein expression, but then the question is, is that simply because the proteomics data is not good enough?" Zhang said.
The MCP study indicates both that proteomics data is of high quality and that a significant portion of protein expression is regulated at the post-transcriptional level and so is not reflected by mRNA expression alone, he said.
The work used datasets from the CPTAC initiative, which has profiled the proteomes of breast, ovarian, and colorectal tumors that previously underwent genomic and transcriptomic analysis as part of the NCI's Cancer Genome Atlas project.
Efforts to explore the question of whether protein or mRNA data would enable better gene prediction have been limited by a lack of large-scale proteomic datasets with matching mRNA data, Zhang said.
"There is a lot of mRNA expression data, but the protein expression data has been very limited," he said. "Previously, the proteomics data platforms have not been as good [as current platforms], and sometimes you would only have a few hundred proteins to look at."
"The beauty of this study, I think, is that we have three different cancer types and much larger numbers of proteins, and all of them point to the same conclusion," he said.
The researchers' proteomics dataset consisted of 90 colorectal cancer samples covering 3,899 proteins, 174 ovarian cancer samples covering 3,327 proteins, and 77 breast cancer samples covering 6,281 proteins.
On the RNA-seq side, the dataset consisted of 264 colorectal cancer samples covering 20,501 genes, 541 ovarian cancer samples covering 17,814 genes, and 1,058 breast cancer samples covering 20,501 genes.
The total overlap between the proteomic and transcriptomic datasets was 87 samples and 3,764 genes for colorectal cancer, 174 samples and 2,988 genes for ovarian cancer, and 77 samples and 5,988 genes for breast cancer.
The researchers used Gene Ontology and KEGG pathway analysis to compare the ability of mRNA and protein co-expression data to predict gene function, finding that in 75 percent of Gene Ontology biological processes and 90 percent of KEGG pathways, the use of protein data improved predictions.
They also used a web tool called Gene2Net that they developed based on the colorectal, breast, and ovarian cancer datasets to identify new gene-function relationships, including establishing a role for HER2 in lipid biosynthesis processes in breast cancer and identifying AEBP1 as a marker of epithelial-to-mesenchymal transition.
The findings, Zhang said, further reinforce the importance of post-transcriptional regulation and the ability of proteomic data to account for such phenomena.
"The significant improvement [the researchers observed] indicates that the post-transcriptional-level regulation plays an important role in coordinating gene functions," he said. "So if you only look at the co-expression of the mRNA to predict the gene function, a lot of times you may get the wrong result. But if you use protein co-expression, which takes into account all the levels of regulation, you will get a better prediction."
The study also indicates that the level of improvement offered by proteomic data varies with the genes and processes being investigated. For instance, Zhang noted, while functional predictions were improved across all of the cancer types he and his colleagues looked at, the improvements were most significant in ovarian cancer.
"We think this is in part because ovarian cancer has a lot of [gene] copy number variation, and if you have a copy number variation, all of the [corresponding] mRNAs will tend to have higher expression levels and will then show high co-expression at the mRNA level," he said. However, as has been previously reported, buffering effects at the post-transcriptional level prevent this higher mRNA expression from being entirely translated to the protein level.
"Maybe the [stronger] improvement in ovarian cancer compared to breast cancer and colon cancer indicates a stronger buffering effect [at the protein level] to adjust for the strong genomic level dysregulation in ovarian cancer," Zhang said.
The predictions should continue to improve as proteome coverage increases, he added. "If we can do global measurements with even better coverage at the proteome level, it will work even better."