Skip to main content
Premium Trial:

Request an Annual Quote

New CZI Virtual Cell Model Combines Language, Gene Expression Training

Premium
Cells

NEW YORK – Researchers at the Chan Zuckerberg Initiative earlier this week launched a new artificial intelligence-based virtual cell model that has been trained on textual descriptions of gene networks as well as single-cell transcriptomics data.

The model, dubbed scGenePT, combines approaches taken by two existing foundational virtual cell models: scGPT, developed by Bo Wang's group at the University of Toronto, and GenePT, from researchers at Stanford University. Ana-Maria Istrate, senior research scientist at CZI, led the team that came up with scGenePT, and the researchers posted a preprint of their work to BioRxiv in October.

The CZI researchers began by training a model on single-cell gene expression data in the manner of scGPT, which can provide a basis for predictions about cell type annotations and help normalize data. They also added text-based data through National Center for Biotechnology Information (NCBI) gene card and UniProt protein summaries, an approach also taken by GenePT, and added gene function annotations from UniProt Gene Ontology.

"A lot of foundation models use one modality," Istrate said, namely gene expression counts from single-cell RNA-seq. "But there's a whole other realm of info you have about genes, published in the research literature. The question we had was, 'Can you use that?' We found that, yes, it's possible … incorporating this prior knowledge might help us improve performance," she said, suggesting that the ceiling for performance on particular tasks could be higher than previously thought.

ScGenePT joins a growing list of so-called "foundational" AI models trained on lots of biological data that can then be used to generate predictions. Tools like scGPT and Geneformer have been trained on millions of single-cell gene expressions profiles. When fed new data, they can use that training to perform various tasks, such as annotating cell types or simulating the effects of gene knockout on transcriptome-wide expression.

Using text to predict cell gene expression patterns has been tried before by GenePT using a concept similar to ChatGPT. ScGenePT's algorithm incorporates this into the gene expression-based model as prior knowledge, said Christina Theodoris, a researcher at the Gladstone Institutes who developed the Geneformer AI model, which is similar to scGPT. "This allows the model to start from a baseline that is informed by prior research on gene functions."

For in silico perturbation experiments, Istrate's team found that text alone was not as powerful as single-cell gene expression data alone but that including it helped AI models outperform other models that had "hard-coded" biological knowledge. They benchmarked scGenePT against GEARS (graph-enhanced gene activation and repression simulator) from Jure Leskovec's lab at Stanford, a deep-learning model for predicting gene perturbation that is based on gene regulatory network graphs.

Specifically, language helps most in cases where the AI model has to predict the effect of two gene perturbations where neither of the genes had been seen during training.

ScGenePT is available for researchers to use through CZI's Virtual Cell platform, launched earlier this week. It includes AI cell models developed by Istrate and other CZI researchers, as well as other leading models, including scGPT. "Researchers can use the initial scGenePT and other models for biological tasks, such as predicting protein localization, annotating cell types, and integrating multiple batches of data," CZI said in a statement.

CZI also issued a request for proposals to build new foundational AI models using its graphics processing unit cluster, which Istrate used to build and train scGenePT.

With language proving to be helpful to gene expression data in building better cell models, Istrate suggested that other data types could also boost performance. "We haven't done experiments with this, but you could include protein information for protein-coding genes" or even add imaging data, she said. "If you can get a representation of a gene from a specific modality, whether images or protein, you can think about incorporating it," she said.