BALTIMORE – Cornell University researchers and their collaborators have developed a Bayesian statistical model that can jointly impute cell type composition and cell type-specific gene expression from bulk-RNA sequencing data using single-cell RNA sequencing (scRNA-seq) data as a reference.
In a study published in Nature Cancer last month, the researchers showcased the model’s robustness in deconvolving both malignant and nonmalignant cell types as well as estimating their gene expression profiles in heterogeneous cancer samples. Named Bayesian cell proportion reconstruction inferred using statistical marginalization, or BayesPrism, the model is tissue-agnostic and can be applied to non-cancer samples, as well.
According to Tinyi Chu, a postdoctoral researcher at Memorial Sloan Kettering Cancer Center and the paper’s lead author, the model was developed to tackle the current hurdles in the field for gleaning cell-specific biological insights from vastly available bulk RNA-seq data. Chu initially developed the model as part of his graduate work at Charles Danko's lab at Cornell University.
Bulk RNA-seq, which relies on homogenizing a tissue sample and sequencing the RNA, is a common way to profile gene expression on the tissue level. However, it does not offer researchers the resolution to determine cell type composition or cell type-specific gene expression, according to Chu.
Meanwhile, scRNA-seq allows scientists to study a sample on a single-cell level, but the technology still has a high price tag and is hampered by batch effects and challenges in capturing different cell types, making it difficult to scale, Chu pointed out.
A handful of cell type deconvolution methods have previously been pioneered by other researchers to bridge bulk RNA-seq and scRNA-seq data, but Chu said these models are primarily regression-based. By constructing a reference expression matrix from a set of arbitrarily defined marker genes, these methods assume the scRNA-seq data are the true reference for the bulk data, overlooking the technical batch effect and biological variation between bulk and reference data, he added.
To solve that issue, Chu’s team developed a Bayesian approach that not only imputes cell type fraction but also jointly reconstructs cell type-specific gene expression within bulk RNA-seq data using scRNA-seq data as a reference. By explicitly modeling the differences in gene expression between single-cell reference and bulk data, Chu said the two variables can correct each other, achieving a more robust deconvolution.
The researchers benchmarked the model with pseudo-bulk RNA-seq data constructed by combining reads from scRNA-seq data, offering a predefined ground truth. Specifically, they used single-cell data obtained from 28 glioblastoma patient samples using Smart-seq as a surrogate for bulk RNA-seq data, and microwell-based scRNA-seq data from eight patients as the reference to simulate both technical batch effects and biological variation. By comparing the inference with the ground truth, the group found BayesPrism significantly outperformed all existing methods in cell type deconvolution, achieving the most accurate inference.
To validate the model in real-life samples, Chu’s team obtained bulk RNA-seq data from 12 whole-blood samples, with ground truth for cell type composition measured by flow cytometry. Using peripheral blood mononuclear cell scRNA-seq data as a reference, BayesPrism achieved more accurate cell type estimates in the bulk sample than other deconvolution methods, demonstrating the model’s robustness in a real-world setting.
Additionally, the study assessed the model’s performance in inferring gene expression profiles of different cell types. By benchmarking the model with the glioblastoma pseudo-bulk samples, the researchers observed that BayesPrism can accurately predict gene expression in heterogeneous cell types, including both malignant and nonmalignant cell types within the tumor microenvironment.
The researchers further applied the model to analyze the proportion of cell types in 1,142 samples from The Cancer Genome Atlas (TCGA) encompassing three tumor types: glioblastoma, head and neck squamous cell carcinoma, and skin cutaneous melanoma. By investigating the cell composition and cell-specific expression profile within the tumor microenvironment, where nonmalignant cells and cancer cells intermix, the study showed that nonmalignant cell types were also correlated with patient survival.
In particular, they found that CD8+ T cells had a strong correlation with survival in skin cutaneous melanoma. Similarly, the proportion of T cells was also associated with better clinical outcomes in head and neck squamous cell carcinoma, while, interestingly, both macrophage content and macrophage cell state appeared to play a role in clinical outcomes across different malignancies.
“I think it’s very exciting,” said Moray Campbell, a cancer biologist at Ohio State University. Although scRNA-seq technology is rapidly expanding, far more tumor samples have been sequenced at the bulk level, he said, and the method described in the study will allow researchers to leverage the vast bulk RNA-seq datasets that are already available to potentially gain additional biological insights.
While the study primarily tested the model on glioblastoma, head and neck squamous cell carcinoma, and skin cutaneous melanoma, Campbell said it will be interesting to see the performance of the method in other big cancers. He said he is interested in trying out the model in his own lab, which primarily focuses on the genomics and epigenetics of prostate and breast cancer, especially on tumor samples derived from genetically engineered mouse models.
However, given that the model is highly sophisticated and might require some computational aptitude, Campbell said it remains to be seen how widely it can be adopted by researchers.
Chu said that in addition to making the model open source, his group is working on building a website portal to allow people who are not familiar with the coding environment to use the model. Additionally, he said he has developed tutorial materials to coach people through running the model.
Because the model, like other deconvolution methods, relies on scRNA-seq data as a reference, its accuracy can also be impacted by the quality of the reference. Therefore, an incomprehensive scRNA-seq reference may result in the model not being able to capture all cell types and cell transcription profiles in a heterogeneous sample type, Chu noted.
To overcome this potential pitfall, Chu said the best practice is to always match the scRNA-seq’s sample type with that of the bulk RNA-seq. While his team has found scRNA-seq data from four patients is sufficient to power the deconvolution model in glioblastoma, Chu emphasized that the more single cell data are put into the model, the more likely the method will capture more cell types, including the rare ones.
Chu believes that even though scRNA-seq is becoming more and more available, it will not supersede BayesPrism, as the model’s performance “will not saturate” but rather continue to improve with more scRNA-seq data available.
In addition, the rich information stored in existing databases such as TCGA, which also contains DNA sequencing data and immunohistochemistry images for each patient, will also necessitate deconvolution models such as BayesPrism to continue investigating the existing data. Despite the study's focus on cancer tissues, Chu said, the model is also suitable for other tissues, as long as there are matching scRNA-seq data available as a reference.
Moving forward, Chu said his team is working on developing an improved version of BayesPrism with a streamlined user interface and memory efficiency in order to make the model more scalable. While the new version is slated for release this month, he said the team is also hoping to extend the model to be able to deconvolve spatial transcriptomics data in the future.