NEW YORK – An algorithm developed by researchers at MD Anderson Cancer Center can identify features, such as genes, that are key to detecting rare cells present in disease, but which are missed by standard single-cell data analysis — and it already has been used to develop a targeted assay to predict minimal residual disease in patients with acute myeloid leukemia.
The machine learning-based algorithm, called single-cell manifold-preserving feature selection or SCMER, purports to offer a more efficient way to analyze single cells. While methods such as UMAP, short for uniform manifold approximation and projection, and other clustering algorithms have provided an elegant and aesthetically pleasing way to reduce the complexity of single-cell gene expression data, they still rely on differential expression of thousands of genes.
"Some rare cell types, because there are so few, they don’t actually become a cluster by themselves" in UMAP plots, said Ken Chen, head of the lab at MD Anderson that developed SCMER. The new method not only provides a way to make sure rare cells don't get lost but also provides a manageable set of genes with which to identify them.
"There's actually a lot of redundancy in gene expression," he said. "Some of the genes, they're not telling us too much. We squeeze out those redundancies and return only the most unique genes."
SCMER whittles down the list of important genes used to type cells from thousands to tens. In one dataset of myeloma patients, Chen said they were able to get the number of genes required to identify the cells down to about 50 genes from approximately 3,000 genes.
Chen and colleagues described their algorithm in a paper published last month in Nature Computational Science. The team ran the algorithm on eight single-cell data sets, including AML, cancer cell lines, and peripheral blood mononuclear cells and compared their results against other feature selection methods including highly expressed genes, highly variable genes, Monocle, and CellSIUS.
"Basically, it tells you what is the smallest set of features that produce a UMAP plot similar to the original," Chen said, "and thus, this set of features are the most important ones to be included in a targeted assay."
Not only does SCMER do this analysis in a single step, it also can analyze thousands of cells in just minutes on an ordinary laptop, said Shaoheng Liang, a graduate student at Rice University, the algorithm's primary developer, and the first author of the paper. And using targeted single-cell sequencing, or other, cheaper testing modalities that aren't as high-plex as transcriptome-wide RNA-seq, could greatly reduce the cost of analyzing rare cells.
"I've already sent the paper to five people today to say, 'Let's take a look at it,'" Gordon Mills, a cancer researcher at Oregon Health & Science University, said last week. Mills has collaborated with Chen before but was not involved in the SCMER paper. "If you look at other algorithms out there, in terms of single-cell RNA analysis, none of them have the suite of features and concepts implemented in this particular algorithm."
The algorithm may not even be limited to working with single-cell data and could be used to analyze "bulk RNA expression, copy number aberration, and genetic and drug screening data in large cohort studies such as The Cancer Genome Atlas" and the Genotype-Tissue Expression database, the authors wrote.
In fact, SCMER was developed to help link cytometry by time-of-flight, or CyTOF, with single-cell DNA sequencing in a study of patients with AML.
Chen said development of SCMER began about a year ago. The researchers' motivation was practical. Colleagues at MD Anderson including postdoc Muharrem Muftuoglu were trying to design a new single-cell assay analyzing protein and DNA together to measure minimal residual disease in AML.
Muftuoglu was looking to link CyTOF and single-cell DNA sequencing data, as cells could be selected for both analyses with surface-marker antibodies. "CyTOF enables us to identify leukemia-specific signaling profiles, and single-cell DNA sequencing enables us to define different leukemia clones having distinct sets of mutations," he said.
Linking these readouts "can give us the opportunity to assess whether different leukemia clones, defined by the presence of different combinations of mutations, have differential enrichment of signaling pathways or unique proteomic profiles defined by CyTOF," he said. "Indeed, this algorithm extracted the surface features defining the subpopulations in the AML ecosystem."
But single-cell transcriptomics data were "naturally an even more exciting question, both biologically and computationally. Thus, we started testing both when the method was implemented," Liang said. The current state of the art in that field is Seurat, a software package that includes clustering algorithms and data visualization for single-cell analysis developed by Rahul Satija of New York University and the New York Genome Center.
Seurat uses a two-step process that clusters cells using the UMAP algorithm, a dimensionality reduction algorithm, and then performs differential gene expression analysis between clusters, Chen said. SCMER, however, works in parallel to this process in one step. "We just tell you what are the genes that differ," he said. "We don't even need to find out what are the clusters. We directly measure the diversity of the entire data set."
The algorithm turns a biological question — "What's the fewest number of genes needed to capture the cellular diversity of a sample?" — into a technical one. "We ask, 'Can we produce the same UMAP plot with the same number of genes or fewer?'" Chen said. "We have a metric which helps us measure how similar the new one is to the original."
In an iterative approach, SCMER generates new plots on ever smaller sets of genes. Then, using a gradient descent method, the algorithm settles on a minimal set of genes. "This allows us to start from any random set of genes but very quickly converge into a locally optimized set," he said. Do this enough times, and eventually you get your answer.
"On a dataset with 10,000 cells and 2,000 candidate features, [SCMER] typically converges in 20 to 40 iterations, which takes five to 10 minutes" using an off-the-shelf Intel core i7-8700 processor, the authors wrote. Using a GPU could easily cut that time in half.
Liang noted that the algorithm was designed to be scalable. In one experiment, they analyzed 40,000 cells, representing 10 patient samples. Twice as many patients, would need approximately twice the memory, he said.
Chen said he has considered commercializing SCMER but hasn't taken action on that. "I probably should have considered it a little more," he said. "Maybe it's not too late yet."
Among the most important aspects of the algorithm is its recall, Mills said, providing an ability to pull out the same sets of genes from run to run. "In most of the classic algorithms, if you run it 20 times, you get 20 different sets of genes," all of which can predict a given phenotype, he said. SCMER, however, seems to focus on the same genes each time. "Those are much more likely to be useful as one moves into orthogonal assays," he said. "If the number of genes to analyze is truly tractable, we could move it into the clinic."
Chen, Liang, Mills, and Muftuoglu all highlighted the potential to develop assays using gene sets returned by SCMER. "Large studies could find a smaller set of genes so they can do things more efficiently," Liang said, or allow the design of cheaper follow-up studies.
The ability to detect rare cells could provide new insights into cancers. "There are cells which express certain RNAs or protein during their course of development," Chen said. "For example, cancer cells can express genes before, or during, metastasis. But once they establish a colony, they turn them off."
SCMER could help find those cells in transition, which "are very difficult to identify with any previous algorithm," Mills said. "One of our beliefs is that transient states are unstable by definition and therefore much more therapeutically tractable than a cell in a viable stable state." Using SCMER in conjunction with drug or CRISPR screens could help find new ways to attack cancer cells.
Chen and Liang also suggested SCMER could be useful for researchers using MERFISH, short for multiplex error-robust fluorescence in situ hybridization, a spatial gene expression method developed by researchers at Harvard University and being commercialized by Vizgen.
"You just cannot use all the genes [in the transcriptome]. You're limited by the number of genes you can detect," Liang said. SCMER could help identify which genes are most informative for a particular study. Chen added that it could even be tuned to help design probes that best capture spatial heterogeneity, not just differences in gene expression.
Mills pointed to one limitation noted by the authors, which is that so far it "does not provide an explicit mapping from one modality to another."
"We meant that SCMER does not tell which genes are the best predictor of a particular protein" in CITE-seq data, Liang said. "Thus, the result is not an explicit mapping between proteins and genes, but a set of genes that makes the data look like the whole protein profile, in terms of the manifold."
"I would love to see an analysis using the single-cell RNA technology, perhaps directly compared to, say, CyTOF analysis of immune cells, such as exhausted T cells. That would really say, here are the populations of cells that have multiple markers and that we know are functional," he said. "There are studies that do that, but not really on a single-cell, multiplexed marker basis to define some of these important cell types."
Those datasets to perform such studies "really don’t exist," Mills said. "But I think that’s the next step."