NEW YORK (GenomeWeb) — A study of RNA-seq data from several Cancer Genome Atlas cohorts has found that patterns of pseudogene expression correspond strongly with several molecularly or clinically defined cancer subtypes and, in some cases, might offer additional subtyping or prognostic prediction.
The results of the study, led by Han Liang, an MD Anderson Cancer Center professor of bioinformatics and computational biology, appeared this week in Nature Communications. The group found initially that pseudogene signatures corresponded closely with established molecular subtypes in several cancers, including uterine, breast, and kidney cancer. The researchers also found that in kidney cancer, pseudogene expression could offer the ability to further subdivide patients and more accurately predict their prognosis than current clinical measures or other molecular strategies.
According to Liang, investigations of pseudogene expression have previously been limited by small sample sizes. But with the TCGA database, which includes global RNA-seq data in addition to sequence data of protein coding genes for a number of different cancers, he and his group realized they had a new opportunity to look broadly at pseudogene expression and how it might correspond to established prognostic and clinical cancer subtypes.
Liang told PGx Reporter that his team became interested in the potential of pseudogene expression to define cancer subtypes and prognosis after reports of single pseudogenes playing a regulatory role in the expression of protein coding genes like PTEN and KRAS began to emerge several years ago.
Liang and his colleagues are focused, he said, on the role of the "dark matter" of the genome, rather than the widely studied protein-coding portion of the genome in disease. "There are 20,000 protein-coding genes, and probably equal numbers of pseudogenes, the vast majority of which are transcribed," he said.
"Nintey-nine percent of research is focused only on protein-coding genes," he said. "But [pseudogenes represent] an equally large population of transcribed molecules, and with individual studies suggesting that they could play a regulatory role in the transcription of protein coding genes through multiple mechanisms … that suggested we should look at this in a large scale."
"Before, TCGA RNA-seq data was only generated in small scale, so this gave us the first large-scale dataset to assess pseudogene subtype classification and prognostic value," he added.
Liang said that he and his coauthors began their project by developing a computational strategy to quantify the expression of pseudogenes in TCGA RNA-seq data by filtering out reads that were least likely to correspond to pseudogene sequences using two databases, the Yale pseudogene database and the GENCODE pseudogene resource.
Overall, the researchers analyzed expression levels of 9,925 pseudogenes in more than 378 billion RNA-seq reads from the TCGA data for 2,808 samples representing seven cancer types — breast cancer, glioblastoma multiforme, kidney cancer, squamous cell lung caner, ovarian cancer, colorectal cancer, and uterine cancer.
The researchers then began to look to see if they could identify differentially expressed pseudogenes associated with subtypes of these cancers, for example, serous versus endometrioid uterine cancers, or the breast cancer molecular subtypes HER2-enriched, basal-like, luminal A, luminal B, and normal-like.
According to the authors, numerous pseudogenes showed significant differential expression in these different cancer subtypes: 48 in uterine cancer, 138 in lung cancer, 71 in glioblastoma, and 547 in breast cancer.
Liang and his colleagues looked more closely at a subset of these samples, the uterine cancer cohort, to investigate the potential clinical utility of the differentially expressed pseudogenes they discovered.
Dividing the samples into a training and a testing set, the researchers found that a pseudogene expression profile could accurately classify the two histological subtypes. These subtypes have distinct pathological and clinical characteristics, and the finding yielded an area under the receiver operating curve of around 0.9, varying slightly depending on which algorithm the group used to evaluate the data.
According to the authors, the performance of the pseudogene expression signature was comparable with mRNA expression, suggesting that both or either approach could effectively classify uterine cancer subtypes.
The researchers reported that they also looked at the concordance of pseudogene expression with established molecular and clinical subtypes in the other cancers and found high concordance in many of them.
In breast cancer, for example, the team used consensus clustering to divide 837 breast cancer samples into four subtypes based on pseudogene expression. These four groups corresponded closely, though not exactly, with the established PAM50 molecular subtypes, as well as with ER/PR/HER2 status.
One subtype contained 70 of 139 basal-like samples, while another contained 382 of 390 total luminal A and luminal B samples. A third pseudogene-defined subset covered 50 of 67 HER2 samples.
According to the authors, the results suggest that pseudogene expression may represent a complementary and independent approach for defining and investigating different cancer molecular subtypes.
In a final analysis, the group also directly investigated the clinical potential of pseudogene expression in a third cancer type, kidney cancer. Interestingly, the results suggested that pseudogene expression might be able to distinguish prognostic kidney cancer subtypes that other molecular or clinical measures cannot.
Focusing of 500 pseudogenes with the most variable expression in the kidney cancer group, the researchers classified 446 total samples into two distinct pseudogene-defined subtypes with significantly different prognoses — 75 versus 63 months on average.
According to the authors, patients in the higher-risk subset may benefit from earlier, more aggressive therapies than those in the lower-risk set.
The group then divided the cohort into four risk groups based on overall survival. According to the authors, while clinical factors and pseudogene expression could distinguish the highest and lowest risk groups, clinical variables alone and other molecular markers like mRNA and microRNA failed to separate the two middle-risk groups. Pseudogene expression, however, was able to separate the two middle-risk subsets.
The results seem to suggest a unique and added value for pseudogene expression in differentiating patient prognoses in this cancer.
Liang said he and his colleagues hope to validate these findings in an independent set of RNA-seq data. "We want to look for other independent large-scale data to see if the pattern we saw can be recaptured. If that can be confirmed we can seriously move to a more clinical setting like a clinical trial scale to see if we can develop a pseudogene prognosis model to help stratify and treat patients," Liang said.
The team also plans to continue to investigate additional TCGA cancer types, he added, as well as expand their efforts to look at RNA-seq data generated in house at MD Anderson.