NEW YORK – Researchers from the Ontario Institute for Cancer Research have developed an algorithm that can better classify cells from the tumor microenvironment and provide tissue of origin based on single-cell transcriptomics data.
In a talk at the annual meeting of the American Society of Human Genetics, held virtually last week, Ido Nofech-Mozes, a graduate student at the University of Toronto, described how his team developed a tool for the automated annotation of cancerous, immune, and stromal cells in single-cell RNA-seq cancer research.
Manual annotation of these cells is difficult because a "large degree of interpatient heterogeneity in cancer cells makes it hard to find consistent markers that can be used across cancer samples," he said. "Cancer cells tend to cluster by patient and not by cell type, even," he said.
To build the algorithm, Nofech-Mozes and his colleagues used reference data from a paper published last year providing single-cell data on about 53,000 cancer cells from 198 cancer cell lines, representing 22 solid tumor types and 58 cancer subtypes. They also added CITE-seq, for Cellular Indexing of Transcriptomes and Epitopes by Sequencing, data from a paper published earlier this year that comprised about 210,000 blood cells and included cell markers matched with whole-transcriptome data. Finally, they included data from 10,000 stromal cells from four normal tissues from the Human Cell Atlas, including fibroblasts, smooth muscle cells, endothelial cells, oligodendrocytes, and enteric glial cells.
The algorithm itself used differentially expressed genes to select features and was trained using multiple hierarchically organized random forest models.
"We were able to reduce a large amount of complexity in pan-cancer classification," Nofech-Mozes said. Cells are classified at several levels: The first level was determining if they were a cancer cell, blood cell, or stromal cell, while the highest level of classification included labels such as "ovarian cancer cell" or "CD8+ T cell."
The new algorithm's high-level classifications improve upon previous attempts at automatic cell annotation pipelines, which often are limited in their granularity. "Not being able to move past these broad classifications limits how we can use these results in clinical applications," Nofech-Mozes said.
The algorithm is the latest attempt to improve classification of single cells. The University of Toronto team compared their algorithm against three others, including SingleCellNet, a broad single-cell data tool developed at Johns Hopkins University; scmap-cell, a nearest-neighbor classification algorithm that projects a cell onto a reference dataset, developed by researchers at the Wellcome Sanger Institute; and Characterization of Cell Types Aided by Hierarchical Classification, or CHETAH, developed by researchers at the Princess Máxima Center for Pediatric Oncology in the Netherlands, which was designed for use with tumor samples.
The researchers validated their algorithm against a large, curated tumor-derived cell atlas, containing data on more than 1 million cells from 250 patients, representing 14 major cancer types. All cells had annotations from the original study authors. The team calculated F1 scores, which asses the quality of classification, and compared their algorithm with SingleCellNet, scmap-cell, and CHETAH.
Their algorithm scored 0.93 out of 1.00 for cancer cells, greatly outperforming the other algorithms, which Nofech-Mozes said struggled to interpret the interpatient heterogeneity between cells in the testing and reference sets. The algorithm scored 0.99 for stromal cells and 0.89 for blood cells, beating the other algorithms, though they also performed well for these cells, he noted.
Using their tool, the team was able to reanalyze data on a lung cancer sample from a benchmarking study of the Broad Institute's Aviv Regev and Orit Rozenblatt-Rosen, published in 2020 in Nature Methods. The original analysis was limited to classifying cells as either B cells, endothelial cells, epithelial cells, fibroblasts, macrophages, mast cells, or T cells. Moreover, there were multiple epithelial cell clusters. The new tool resolved which epithelial cell clusters were cancerous and identified them as lung cancer cells. The Toronto team was even able to subtype T cells in the study and annotate certain rare dendritic cells, which the original authors had suggested were B cells.
In another example, the new tool was able to trace tissue of origin for metastatic cancer, identifying malignant cells in a liver sample as breast cancer cells.
The algorithm opens up pan-cancer studies at the single-cell level, Nofech-Mozes said, which can help show how tumors develop and highlight new transcriptional programs that could be targeted with novel therapies. The method could also help subtype circulating tumor cells, he suggested.