Skip to main content
Premium Trial:

Request an Annual Quote

New Stanford AI Tool Predicts Gene Expression Profiles From Pathology Slide Images

Premium

NEW YORK – Researchers from Stanford University and their collaborators have developed an AI model that can predict gene expression profiles for cancer samples using whole-slide images.

Described in a Nature Communications paper last month, the algorithm, dubbed Slide-based Expression Quantification using Linearized Attention (SEQUOIA), can potentially help with clinical decision-making and risk stratification for oncology, though its real-world clinical utility still remains to be determined.

"The whole idea of this model is that we use a routinely collected image, which is this H&E image, to predict the RNA sequencing," said Olivier Gevaert, a professor of biomedical data science at Stanford and the corresponding author of the paper.

To build SEQUOIA, Gevaert's team utilized a linearized transformer model to capture contextualized whole-slide image features. They also leveraged UNI, a foundation model developed by Harvard University researchers that was pretrained using more than 100 million images from over 100,000 diagnostic hematoxylin and eosin-stained ​​whole-slide images across 20 major tissue types.

To train and validate the AI algorithm, Gevaert said the team used whole-slide images and matched bulk RNA-seq gene expression data from 7,584 tumor samples of 16 different cancer types in the Cancer Genome Atlas (TCGA). Given the heterogeneity of cancer, they developed and validated the model for each tumor type.

Overall, the authors noted that SEQUOIA was able to "accurately predict the expression levels of many genes." By comparing the predicted gene expression profile with the ground truth obtained by RNA-seq, they concluded that 15,344 out of 20,820 genes were "significantly well predicted" across the 16 cancer types.

They also observed that the number of well-predicted genes was positively correlated with the number of training samples available for each cancer type. For instance, the researchers identified the highest number of well-predicted genes, 18,878, in breast cancer, which had the largest number of training samples. For thyroid carcinoma, they identified 18,758 well-predicted genes, and 17,623 genes for kidney cancer.

Meanwhile, the lowest number of well-predicted genes, just 9,535, was for pancreatic adenocarcinoma, which had the lowest number of training samples of just over 200.

The researchers further tested the model's generalizability by benchmarking the algorithm using independent cohorts from the Clinical Proteomic Tumor Analysis Consortium (CPTAC). They validated the algorithm using samples of seven cancer types including breast, lung, kidney, brain, colon, and pancreas, from the CPTAC cohort. Beyond that, they also validated the model with a lung adenocarcinoma cohort from Tempus.

The researchers found the well-predicted genes were associated with key pathways implicated in cancer progression, including those for regulating cell cycles, inflammation, angiogenesis, and hypoxia response. Additionally, SEQUOIA effectively captured cell-type markers, including those for endothelial cells, CD4 T cells, M2 macrophages, and B cells.

The Stanford team further explored SEQUOIA's potential clinical utility for predicting clinical outcomes, focusing on breast cancer, where they developed a 272-gene signature that aims to predict the risk of breast cancer recurrence.

Overall, the study showed that the gene expression profile predicted by SEQUOIA can be harnessed for risk stratification, and patients assigned with high-risk scores appeared to have shorter recurrence-free survival compared to those with low-risk scores.

Although SEQUOIA was developed using bulk RNA-seq data, Gevaert said, the group sought to tap the model for predicting spatial gene expression at the regional level, given that the algorithm was trained using smaller tiles derived from the whole-slide images.

To achieve that, they developed a spatial prediction algorithm that can help infer region-level gene expression patterns within tumor tissues, validating the results with two spatial datasets from independent cohorts of glioblastoma and breast adenocarcinoma patient samples. 

SEQUOIA is publicly available on GitHub, Gevaert noted, where his team has deposited the codes for data preprocessing, model training, and evaluation. Researchers can download the model and run it locally for their own experiments, he added.

SEQUOIA is very cheap to run, he noted, with the only overhead being the computing cost. With a dedicated GPU, the model can generate results within minutes or maybe even seconds.

Still, Gevaert noted that AI models such as SEQUOIA will likely not replace RNA sequencing but rather be complementary to molecular biology-based technologies. 

With the rise of AI-driven digital pathology technologies, SEQUOIA is not the only model tapping the vast oncology staining imaging databases available in order to inform cancer risk stratification and clinical decisions. Some of these algorithms have already been developed into commercially available tests.

New York-based Ataraxis AI, for instance, in November launched the Ataraxis Breast test, which was developed using the Kestrel pan-cancer foundation AI model for breast cancer prognostication using imaging slides.

Gevaert said his team plans to first use SEQUOIA as a research tool to generate gene expression profiles with the available imaging data. They also plan to further develop and validate the model’s spatial transcriptomics analysis accuracy and capability.

While Gevaert said he is not familiar with Ataraxis AI and its technology, his team is also interested in potentially establishing a company to commercialize SEQUOIA into a clinical assay. To that end, they have been working with pathologists at Stanford to brainstorm suitable clinical applications for SEQUOIA and discuss possible FDA submissions down the road.

"We are going to first figure out what the most low-hanging fruit clinical use cases [for SEQUOIA] are," he said. "That is the first step toward commercial translation."