Skip to main content
Premium Trial:

Request an Annual Quote

AI Models Are Transforming Genomics Research, and Virtual Cells Are Just the Beginning

Premium
Cells Abstract

NEW YORK – Alexander Bick was initially skeptical that using artificial intelligence algorithms in his rare disease research was going to work but felt he had little choice but to try it out.

One of the diseases he studies is RUNX1 familial platelet disorder, a blood condition observed in just a couple hundred people in the US. Pathogenic mutations in the transcription factor RUNX1 affect hematopoietic stem cell differentiation resulting in a higher risk for blood cancers, among other symptoms such as prolonged bleeding. His lab at Vanderbilt University Medical Center and their collaborators had recently turned to single-cell studies of the disease looking for differentially expressed genes to suggest drug targets.

However, not only were there very few patients to draw samples from, but those samples could only be obtained by a painful bone marrow biopsy. "When you use [differential gene expression] for a small number of samples, the results are just not reliable," Bick said. "The differences you see are not what you're interested in."

Enter: Geneformer. At a November 2023 meeting hosted by the Chan Zuckerberg Initiative, which was funding his work on RUNX1, Bick saw a presentation by Christina Theodoris, a researcher at the Gladstone Institutes who had created one of a new breed of AI tools that were fed massive amounts of biological data, in the same way ChatGPT has "read" all of the internet. Geneformer had been trained on every public human single-cell gene expression dataset that Theodoris had access to in 2021 — from about 30 million cells. CZI then further trained, or "fine-tuned," the AI model with a subset of data from its CZ CellxGene Census, a package of tens of millions of single-cell transcriptomes.

Among other capabilities, Geneformer can distinguish between cell states, including healthy and diseased. Moreover, it can also simulate the effects of up- or down-regulation, or even knockout, of a particular gene and predict whether it makes a diseased cell look more like a healthy one or vice versa. In a May 2023 Nature paper introducing the model, Theodoris showed how in silico experiments using the tool were able to identify hundreds of genes whose loss was predicted to cause a shift from healthy to diseased states in cardiomyocytes.

Bick and his collaborators considered what Geneformer could do to help them study hematopoietic stem cells and decided to try it out. "The field of single-cell computational methods is moving and developing so rapidly that I am just generally skeptical of all new methods until we try them in our hands," he said. "So, it was a general sense of 'How much could this tool actually help me execute my science?'"

A year later, the researchers are able to run in silico genome-wide perturbation studies in about 24 hours with the help of graphics processing units (GPUs), hardware that accelerates the use of the AI model. "To do that in hematopoietic stem cells, which are not easy to come by, would be hundreds of thousands of dollars and many, many, many months," Bick said. Using CZI's refined Geneformer model was free, on the other hand, and not just because he is a CZI grantee.

Not only is Bick's team getting lists of genes to target in wet lab experiments, "we're finding things that are different from if we'd just looked at differential gene expression," he said, and the predictions are looking pretty good.

"We're starting to see really exciting results," he said. "The success rate is not 100 percent, not even 50 percent, but for every five genes, one seems to be working in an experimental system," he said. Collaborators are taking existing drugs that target those genes and finding that they can change the state of an in vitro hematopoietic cell. "These are genes we wouldn't have thought to test without these models," he said.

'Get on the train'

Simulated perturbation experiments are just one of the many uses of Geneformer — it can also annotate cell types and predict genes central to a gene network — and Bick is just one of more than 30,000 users who have downloaded the model. Geneformer itself is just one of many new "foundational" AI models trained on heaps of single-cell gene expression data, while researchers train other foundational models across genomics.

The AI era has definitively arrived in the life sciences, and some scientists are hoping it will herald a grand unifying vision of cellular and molecular biology, perhaps a squishier version of what particle physicists enjoy with their Standard Model. In a preprint posted to ArXiv last month, a host of high-profile genomics researchers including Head of Genentech Research Aviv Regev, CZI Head of Science Stephen Quake, and the University of Washington's Jay Shendure, joined by AI tool pioneers such as Theodoris, outlined their vision for how AI could help build "virtual cells" that could generate "universal representations of biological entities across scales … facilitating interpretable in silico experiments to predict and understand their behavior using virtual instruments."

"I think of this like learning to speak 'computation with a biology accent,' or 'biology with a computational accent,'" Regev told GenomeWeb. "This trend is already naturally occurring in early-career researchers, and as the traditional boundaries between fields continue to erode, we will have more creativity and discovery." 

The use of AI will even change how biological science is conducted from hypothesis generation to data analysis. "Bottom line: Over the next decade, we'll see biology change from being 90 percent experimental and 10 percent computational to 80 computational and 20 percent experimental," Quake said.

There are potential drawbacks, though. "We may have to forgo our ability to build fully mechanistic models," the ArXiv preprint authors wrote, noting that such models have been "one of the hallmarks of scientific discovery in biology."

"There are so many things for which a cell avatar and cell oracle can be useful and impactful, even if it lacks in other ways," Regev said. "Just like running genetic screens with cells and animals does not give a direct mechanism but tells us a lot about biology, a well-performing virtual cell can teach us a lot, and then for other purposes, we can use other approaches."

To fully realize the promise of AI, even more data on cellular behavior of all types — epigenetic, functional, interactional — are needed, experts say. Whether those data can be taken from researcher-directed, hypothesis-driven studies, or if they need to be purpose-generated to optimize their utility for AI models, isn't clear. And, as in other fields, AI threatens to take over some of the mid-level work assigned to researchers in training. However, the potential rewards may be irresistible.

"Woe to those who ignore it," said Garry Nolan, a researcher at Stanford University who has used AI tools in his own lab. He has cofounded a startup, Cellformatica, that uses ChatGPT-like AI to generate hypotheses based on uploaded data, including outlining the experiments one might need to test them — until now, the task of human scientists.

"It's inevitable. I don't know what else to say, except get on the train before you're left at the station," he said. "And it's moving so fast. Every other week, I feel like the work we've done has been enabled with another tool."

Other AI-based cell models that Bick and his team could have used include scGPT from Bo Wang's Lab at the University of Toronto; scBERT, a model from researchers at China's Tencent AI lab and Shanghai Jiao Tong University, which takes the same fundamental approach as Google's bidirectional encoder representations from transformers (BERT) model; single-cell Variational Inference (scVI), a model developed by Nir Yosef's lab at the University of California, Berkeley; and Universal Cell Embeddings (UCE), developed by Stanford researchers in Jure Leskovec's and Quake's respective Stanford labs in collaboration with CZI.

The emergence of these models represents the "biological data revolution and AI model revolution coming together," Quake said. "It couldn't have been done very long ago." It also means that what's happening with virtual cells isn't much different from what's going on elsewhere in the world. "It's very much riding on the coattails of the AI revolution," he said, such as text analysis and image generation.

The transformer AI architecture that large language models (LLMs) like ChatGPT are based on was introduced in 2017 by researchers at Google, leading directly to BERT, OpenAI's GPT-4, and other "foundation" AI models, which are trained on a broad set of data and can respond to a wide range of queries. Moreover, they can be refined with additional data to take on more narrow tasks.

These have enabled ChatGPT and other "generative" AI tools that have driven news headlines over the past couple of years and provided punchlines and wacky illustrations for scientific conference presentations. Geneformer is also based on a transformer architecture and is a generative AI; however, instead of a list of questions to ask at a panel discussion, for example, it might produce a gene expression profile for a cell without a key gene.

Broadly speaking, LLMs are well suited for genomics, said George Vacek, global head of genomics alliances at Nvidia, whose GPUs are often used to make training and use of AI models faster. "DNA is the language of life, with nucleotides encoding information, so LLMs can use an analogous approach for studying biological problems," he said.

LLMs, foundational models, and generative AI all fall under deep learning models, a subset of machine-learning methods that is distinct from those used in classifier models such as random forests, which are currently applied in diagnostics and other clinical fields.

"We've seen amazing success in proteins with LLMs," Quake said. "There's been seminal, huge impact on protein design and understanding structure. It has raised our expectations that hopefully we can do something in the world of cells." Earlier this month, two Google DeepMind researchers won shares of the Nobel Prize for chemistry for their work on AlphaFold, an AI algorithm that predicts protein structure based on amino acid sequence.

DNABert is another LLM for genomics, which has been trained on the human genome reference sequence. "It really understands genetic sequence," Vacek said, adding that it's helpful for tasks such as identifying functional variants. Like virtual cell models, there are many flavors of DNA LLMs, including Grover, from Anna Poetsch's lab at the Dresden University of Technology in Germany, and regLM, developed by Genentech. Applications include sequence design, such as promoters and enhancers, and predicting the fitness of a variant.

As key as transformer models and GPU acceleration have been to developing foundation AI models, they're not powerful unless they have enormous amounts of data to train on. Geneformer is a special case as it was trained "from scratch" on data from 30 million single cells. "That was all the publicly available data we could identify at that time," Theodoris said. More recently, she has retrained the model on approximately 100 million cells. With a broad training base, others can now use smaller amounts of their own data to fine-tune the model for specific applications.

The fine-tuning can be done multiple times. The specific tool Bick used was CZI's model of Geneformer, which was further trained on CZ CellxGene data using the Census, a collaboration with bioinformatics firm TileDB to fit the data together in a way that makes it easier to port data over to AI models. Bick also fine-tuned the CZI Geneformer model with gene expression data from 10,000 cells that his team had analyzed. CZI has trained other AI models with the 70 million-cell Census, including scGPT, UCE, and scVI, and provides access to these tools for free to interested researchers as part of its philanthropic mission.

Some models, such as UCE, are designed to work without fine-tuning, an approach called zero-shot learning. "You can embed whatever data you want," Quake said. That means it can handle cell types from organisms that it has never seen before — say, octopus — and still perform reasonably well. "Hopefully, that sets the standard for other people making models going forward," Quake said.

Once a foundational AI model has been trained with the relevant data, it needs a task to perform. In silico perturbation experiments are one powerful use case for virtual cell models, but they can do many useful things.

Cell typing is a major strength of Geneformer, UCE, and scVI, according to Ambrose Carr, director of product management for data at CZI. "It's helpful to say, 'this is a lymphocyte' versus 'this is a fibroblast,'" he said. Predictions "are usually not perfect," Carr said, "but a reasonable prediction of what kind of cells and biology you're seeing in your sample is really helpful and expedites the process of understanding what your data are saying."

Data normalization, such as eliminating batch effects, and multimodal integration of data are two more uses. Broadly speaking, "simulation is one of the great strengths" of AI models, Nvidia's Vacek said, not just of virtual cell models.

"Generative AI does a much better job of simulating the true complexity of the human genome properly" than previous approaches, he said, especially regarding structural variants, which are harder to simulate than SNPs and indels. "Conversely, it would be better at calling structural variants, as well," he said.

In addition to free access through CZI, the AI tools mentioned in this article could be downloaded directly off GitHub and run on a laptop computer, though that might be an excruciatingly slow process. To grease the wheels, some companies have already begun commercializing them, from startups to public companies like Nvidia and Ginkgo Bioworks.

Capitalizing on commercial opportunities

Given generative AI's ability to create novel protein sequences and perform in silico perturbation experiments, drug discovery is a fertile area for companies to apply these tools, and several companies are releasing ones they've created for public use.

Nvidia offers BioNemo, an AI platform for building and training models for drug discovery, including 3D protein structure prediction, de novo protein and small molecule design, and molecular docking, among other applications, including genomics with the models Geneformer and DNABert. Numerous companies in the drug discovery, sequencing and infectious disease fields are using BioNemo to build and use generative AI models, Vacek said.

Startups are all over this space, including UK-based Shift Bioscience, which raised $16 million in seed funding this month, and UK-based Phenomic AI, which has developed a modified version of scVI to look for unique targets expressed in cancer tissue versus normal tissue. Last month, Phenomic released a free version of its tool that contains its data from normal tissues but not its cancer sample data, which it considers proprietary. Like other similar models, Phenomic AI's tool can do cell typing and data normalization, said Sam Cooper, the firm's cofounder and chief technology officer.

"You can train a machine-learning model to go from English to French without having any paired data. So, you can train a model just to read English and just to read French, and it'll figure out a rough translation approach that works surprisingly well," he said. Phenomic AI uses that approach for getting rid of technical batch effects between single-cell datasets generated by different assays, namely 10x Genomics' Chromium assays and plate-based assays from InDrop, another droplet-based method that is mostly defunct.

"The most exciting thing is, we think we can take the same approach to mapping spatial RNA and bulk RNA, as well," he said. "We can create a unified model of different sorts of RNA expression technologies."

This translation approach could help with multimodal data integration, such as single-cell ATAC-seq and methylation data. Current methods often use single-cell gene expression as a way to bridge the datasets, often from co-assays. "It's not as good as having massive amounts of paired data, but there's not that much paired data in biology compared to the amount of unpaired data," Cooper said. "And the differences between the technologies and modalities are way smaller than they are between English and French."

In late September, Ginkgo Bioworks began selling access to AA-0, a protein LLM built in collaboration with Google through an application programming interface. The model is built on Ginkgo's proprietary data on protein structures and interactions. It's one of several AI models in development at Ginkgo and part of a broader strategy of offering its proprietary technologies to customers. The firm is also selling data for others to train AI models on.

To start, Ginkgo is offering two uses of AA-0. The first allows customers to "mask" a particular section of the input, say, a variable region in an antibody, and the model will fill in what's missing. The second is an "embedding calculation," an intermediate step in protein classification that determines, for example, whether a protein is a kinase or how many proteins in a dataset are kinases.

"For a protein with around 500 amino acids, users should be able to get predictions on 2,000 sequences for roughly 20 cents," said Ankit Gupta, general manager of Ginkgo AI, adding that there will also be a free tier of access to the model.

For interested researchers, the barrier to entering the brave new world of AI isn't high. "Researchers comfortable running computational tools will not find [Geneformer] so different," Bick said. "Two graduate students can pick it up over the course of a week."

Other tools also have good documentation, making them relatively easy to pick up, said Neda Mehdiabadi, a rare disease researcher at Australia's Murdoch Children's Research Institute, who has tried out both Geneformer and scFoundation, a model developed by researchers at China's Tsinghua University. "I could understand both of them," she said. "I didn't need to have direct input from the authors. The only reason I decided to go with Geneformer was the recent update to make the model larger."

But is bigger necessarily better? Benchmarking foundational AI models against each other is an emerging challenge, as is comparing them to methods already in use — including human intuition.

"It's really important to have benchmarking on biologically meaningful tasks, as well as a diverse panel of those tasks, to confirm that the model has learned generalizable knowledge and to ensure that we consider how the ground truth was established," Theodoris said. "Because in some cases, it might not be very clear."

Though LLMs are prone to hallucinate in ways that could be detrimental to science, like making up citations, models like Geneformer don't suffer from this in the same way because of specificity of the data they were trained on. Moreover, by using them as hypothesis generators rather than the final word, it's only a question of good or bad hypotheses and not real versus imagined results.

"A hallucination could just be a bad prediction. It's something we're still learning about," Bick said, noting that his team is trying to do more systematic benchmarking of Geneformer and its predictions. "Some of the questions we're trying to answer are, 'What is our comparator group?' and 'What's our null hypothesis?' Is it a random set of genes pulled out of a hat? Is it some researcher saying, 'Here are six genes I think are cool?'"

Theodoris further suggested "there could be a lot of questions that we're able to answer with simpler approaches, where we don't necessarily need these larger models. We really want to understand where they are able to push our knowledge and make the predictions that the other approaches are not able to."

But the way things are going, it may be hard to imagine that AI won't work its way into every aspect of science.

Cleaning up a mess

Over the course of more than three decades as a scientist, Nolan has generated heaps of data. As a cofounder of Akoya Biosciences, IonPath, and Scale Biosciences, he has helped others create heaps more. Lately, he has become frustrated, feeling he has helped drown the field in more data than researchers could hope to analyze.

But with a new startup founded last year, called Cellformatica, he's hoping to "clean up the mess my lab helped to create." The firm uses an LLM trained on 38 million PubMed abstracts, 6 million full-text articles, and 17 structured biological datasets to generate novel research ideas when provided with data — a mass spectrometry signature, a list of target genes — and a context, such as head and neck cancer. "What we've got behind the scenes is a Ph.D.-level scientist doing six months of work for you in an hour," Nolan claimed.

"It gives you hypotheses, many of which you could come up with yourself, but why should you?" he asked. In addition, it will outline the validation experiments needed to test the hypotheses. One can even tell it not to exceed a certain cost with experiments or to exclude results from a particular lab in its analysis.

"It isn't creative by nature," Nolan said, but it does have a huge advantage over a human — it can analyze the entirety of the scientific literature at blazing fast speeds. "It basically does the hard part of the legwork of going into the literature and finding answers and summarizing them for you in ways that you wouldn't have thought of doing before," he said.

Cellformatica also has a module that looks for connections between genes or cellular processes in the context of a particular disease, going as far as building "causality maps" that can show how a cancer progresses.

Nolan was able to use Cellformatica to create hypotheses for which immune cell events were associated with a response to immune checkpoint blockade in an analysis of the tumor microenvironment in Merkel-cell carcinoma. It also provided a list of targets that could be used to test the hypotheses, some of which could be drugged.

Still, Nolan considers Cellformatica to be "relatively primitive" in comparison to what's possible, and a new development in AI models could have untold consequences for biology, he said.

Last month, OpenAI released a new model that promises the ability to perform multistep reasoning, something that its ChatGPT doesn't do. In a blog post, the firm said the new "o1" model outperformed a Ph.D.-level human on a benchmarking battery of physics, biology, and chemistry problems, the first model AI to do so. "These results do not imply that o1 is more capable than a Ph.D. in all respects — only that the model is more proficient in solving some problems that a Ph.D. would be expected to solve," the firm said, noting that it can even be used "to annotate cell sequencing data." OpenAI did not respond to requests for comment.

These so-called "chain-of-thought" models could help researchers determine the questions they need to ask. And such a model would take the Cellformatica approach one step further. "We can probably expect it to be more incisive in how it answers. If we wanted to provide details on how to do an experiment, it would reason through the questions better."

"It really behooves us to take advantage of whatever technology might be available," he said. "If using large language models allow us to better understand what cancer means, faster, we'd be fools not to take advantage of it, if it's sitting right there on offer."