NEW YORK – Single-cell omics data is mushrooming, but making sense of that information is still a challenge, according to Dana Pe'er, chair of computational and systems biology at Memorial Sloan Kettering Cancer Center's Sloan Kettering Institute in New York.
For example, many single-cell sequencing studies in cancer patients don't have large enough cohorts, Pe'er said during a keynote lecture, which was webcast, at the joint Intelligent Systems for Molecular Biology and European Conference on Computational Biology (ISMB/ECCB) conference in Lyon, France, on Monday.
"Right now, most of the data out there is badly designed in small cohorts," Pe'er said. "We need to overcome that, get bigger cohorts that are better designed, and then the discoveries will come," she added in response to an audience question.
Pe'er said that some machine learning in computational biology wrongly benchmarks averages, or what she called "boring populations," where nothing is particularly abnormal, and often misses rare, metastatic cells.
"When you develop computational biology, you can't take machine learning off the shelf," Pe'er advised. "You really have to think, am I capturing the rare cells? Many of the methods would treat these as outliers, and that's where all the biology is."
Pe'er, who accepted the International Society for Computational Biology's annual Innovator Award during the keynote session, provided an overview of the hundreds of pseudotime algorithms she has seen over the last 22 years. Recalling a presentation she saw at the 2001 ISMB conference, Pe'er said she has known for that long that the way to figure out molecular influences is to understand statistical dependencies.
"If we treat each individual cell as an example, then we can actually learn molecular networks," she said, noting that she was doing postdoctoral work in this area back in 2005. "The idea of the power of single cells and treating each individual cell as a sample, giving us enough samples to learn, for example, patient-specific disease networks, really changed my [career] trajectory."
Until mass cytometry came along in the late 2000s, there was no good way to analyze single-cell data, according to Pe'er. A paper she authored with Stanford University's Garry Nolan and others in 2011 was among the first to describe hematopoietic stem cell development as a continuum.
"Cell fate" is now seen as a continuous process in computational biology, Pe'er explained. She noted that hematopoietic stem cells are far less common than T cells. "If you use standard machine learning to sample cells randomly, you're going to get these biased samplings in these very, very dense and boring regions of your phenotypic manifolds," she said. "You're going to actually miss anything that's important."
That led computational biologists down the path of minimum-maximum sampling to find cells that are most different from others.
Pe'er and former colleagues at Columbia University eventually developed Wanderlust, an algorithm that maps the likely development continuum of stem cells in individuals with pediatric leukemia. They described that software in a 2014 paper in Cell.
"This allowed us to order the cells with such accuracy that we could pick up a really rare population, three in 10,000 cells," Pe'er explained.
By 2019, she had worked with Manu Setty, who is now at Fred Hutchinson Cancer Center in Seattle, to create Palantir, an algorithm that evaluates plasticity to predict cell fates, which they described in a paper in Nature Biotechnology.
Algorithmic development and advancement have really accelerated since then.
Working with researchers at Cornell University and Rockefeller University, Pe'er and colleagues helped develop BayesPrism, a Bayesian statistical model that can jointly impute cell type composition and cell type-specific gene expression from bulk RNA sequencing data using single-cell RNA sequencing data as a reference.
CellRank, which Pe'er also had a hand in, came out last year to assist in single-cell fate mapping. Just last week, lead CellRank developer Fabian Theis and colleagues at the Institute of Computational Biology at Helmholtz Munich in Neuherberg, Germany, unveiled a method called CellRank 2 in a preprint posted to BioRxiv. This update enables the study of cellular fate with large-scale single-cell data.
Another advancement is in understanding nonlinear cell development, because linear analysis fails when there is "derailment," according to Pe'er. Acute myeloid leukemia is a disease of derailment, for example, she said.
Also missing have been methods to parse spatial single-cell data, she said. Doron Haviv, one of her students, has addressed the spatial aspect with an autoencoding technique called environmental variational inference, or ENVI. Haviv was scheduled to present a paper on this topic at ISMB-ECCB on Tuesday, based on a preprint unveiled in April.
"One of the fun things about single-cell data is, there are entirely new factors to be discovered," Pe'er said. "Everything we can explain with our known factors, there's still a lot of residual."