NEW YORK – Investigators from healthcare tech firm Owkin are making strides in the development of computational methods that can improve and expand the biomarker information gleaned from histological slide images, which could allow much more complex and comprehensive analyses to be performed within a well-established and relatively inexpensive process.
Although numerous academic groups and companies are exploring the implementation of machine learning or AI to improve and better standardize histopathology, Owkin has taken this a step further than most, showing in a recent study that its approach can not only improve morphologic analysis or visual biomarker detection, but can predict or recapitulate a cancer's gene expression levels without the need to actually sequence DNA.
Reporting their results in Nature Communications earlier this month, company researchers described what they call HE2RNA, a deep learning model that they showed can accurately predict RNA-seq expression of tumors based solely on digitized histopathology images.
The name HE2RNA comes from the tool's ability to glean gene expression from hematoxylin-eosin (H&E)-stained biopsy slides.
Elodie Pronier, an Owkin translational research scientist and coauthor of the new study, said that she and her colleagues have a lot of work to do to establish and validate the most appropriate use cases for their method. But in general, the team believes that the ability to measure gene expression without having to sequence tumor samples could be incredibly useful.
Various gene expression signatures have already found a place in precision oncology, cancer research, and target discovery for personalized drug development.
But transcriptomic sequencing remains relatively expensive compared to the more basic laboratory procedures involved in tissue imaging, Pronier said. Being able to glean transcriptomic information as part of a routine pathology process could be a significant money and time saver and accelerate clinical translation and implementation.
According to Pronier, the ability to use machine learning to connect information at the genomic, cellular, and tissue levels could also be transformational in terms of discerning new biomarker signatures or combining existing markers in ways that make them more powerfully predictive.
"When people do RNA-seq … you can define genes that are associated with survival [or other outcomes], and that's how we've been doing targeted therapies for the last decade," Pronier added. At the same time there may be lost opportunities with genomics and pathology remaining siloed from one another. "I really think that being able to combine these two, you're going to find something new and something different that was not found by either," she said.
The creation of histology slide images from tumor biopsies is a routine step in the process of diagnosis and therapeutic decision-making for cancer patients.
Jeffrey Chuang, an associate professor at the Jackson Laboratory for Genomic Medicine, who was not involved in the study but whose work is also focused on applying deep learning methods to the analysis of histopathology images, said in an email that despite the ubiquity of H&E slide analysis in oncology, it has been unclear just how much information these images might contain. As a result, "development of new machine learning approaches to analyze such data is a scorching hot field," he said.
Numerous efforts have emerged, seeking to improve upon both the diagnostic interpretation of tissue histology and the quantification of specific biomarkers by pathologists.
For example, researchers from Bristol-Myers Squibb and PathAI recently reported that they were able to create an artificial intelligence-powered algorithm that could accurately score PD-L1 expression of tumor and immune cells from immunohistochemically stained slide images. As drugmakers continue to advance cancer immunotherapies for which PD-L1 expression predicts response, they are hoping to be able to rely on more scalable and reproducible methods to assess the biomarker.
Other efforts have shown that even absent specific markers like PD-L1, computer learning can find ways to use morphological and visual patterns in slide images to predict prognosis or drug response.
Owkin's recent publication has pushed this concept further, challenging deep learning to not just recapitulate or extend what a human pathologist might do manually, but to somehow extract information from a pathology slide image that a human might not see at all.
The team isn't the only group to make this leap, with other recent studies showing that deep learning can use slide images to predict gene mutations in lung cancers and prostate cancers. Similar approaches have also been used to build models to predict genomic information from brain tumor MRIs.
In his email Chuang said that the mapping of histopathological image data to genetic expression data "has immense potential, as in theory one could obtain characterizations as specific as those from RNA-seq but at the much lower cost of H&E imaging."
He called the Owkin team's results in this vein encouraging but added that "further research will be necessary to determine the range of questions for which H&E-based predictions will be sufficient."
Training and validation
In their study, the Owkin investigators described their development of HE2RNA using a collection of matched H&E-stained pathology images and RNA-seq data from 28 different cancer types and 8,725 patients in The Cancer Genome Atlas database. The team fed their learning system these pairs, batched into training and validation sets in a five-fold cross-validation, and tested whether a resulting algorithm could take a new, unknown slide image and infer what the matching gene expression data would be.
"The computer has known that image A goes with RNA-seq B. But then if you give [it] image C [without any] RNA-seq data, it predicts what will be the RNA-seq data," Pronier said.
According to the authors, HE2RNA predicted an average of 3,627 genes, including 2,797 protein-coding genes, per cancer type, with a "statistically significant correlation."
The number of significantly well-predicted genes varied considerably between cancer types, which the group said looked to be mostly due to the relative size of the datasets used. For example, only seven genes were accurately predicted for 44 samples of diffuse large B-cell lymphoma, compared to 15,391 that were correctly predicted for the 1,046 samples of lung cancer in the dataset.
Investigators also compared the genes that were the most accurately predicted in each cancer type. Although none were well-predictable in all 28 types, there were several that were above the significance threshold in smaller groups of cancers. For example, C1QB expression was "strikingly well-predicted" in 17 of 28 different cancer datasets, while NKG7, ARHGAP9, C1QA, and CD53 were accurately predicted in 15 of 28 datasets.
Overall, 156 genes were found to be well-predicted in at least 12 out of 28 different cancer types. Somewhat expectedly, this list was enriched for genes involved in immunity and T-cell regulation. If a computer is able to detect visual signals that relate to gene expression it would make sense for those signals to reflect genes involved in immune infiltration, since this is something that should result in clearer visual or morphological changes to the tissue.
The data also suggested that HE2RNA could detect pathways deregulated in specific types of cancer. In liver cancer, for example, well-predicted genes were associated with mitosis and cell-cycle control, "known hallmarks of cancer." And in breast cancer samples the group saw strong prediction of expression levels for various genes involved in cell-cycle regulation, as well as the gene CHEK2, which is known to be mutated in these tumors and involved in their progression.
According to Pronier, the fact that the model is picking up on things like immune infiltration is encouraging, and helps to build a base of validity for this type of machine learning approach.
"The ones that are really easy to predict are the ones that are seen visually. For example, if you take an H&E slide, looking just by the color of the cells and the shape of the cells, even if you're not a pathologist … you're going to be easily able to detect immune cells because compared to other cells they're shorter and smaller. The nucleus is really tiny," Pronier said.
"After a few slides, even you are like the machine learning algorithm: you learn to recognize immune cells, she added. "So it's reassuring, that [the machine] can do something that easy."
When the group considers what else HE2RNA appears able to do — to predict expression of other genes that don't necessarily have a direct known, or predictable effect on tissue appearance or morphology — it helps to have this baseline of data that indicates that the system is actually picking up something real, even if human eyes can't discern it.
But there are other things the algorithm seems to be able to do that don't make immediate sense in a visual context.
"There are these other portions where … even the pathologists are going to tell us 'I don't see what the machine is seeing.' And that's the part where we are doing a lot of research because it's hard for us to tell if it's a mistake from the from the machine or if it's something that's not known yet," Pronier said. "Just because the pathologist doesn't see it doesn't mean it doesn't exist."
Clinical applications
In a final set of experiments for the study, Pronier and her colleagues also tested whether specific gene signatures that are known to be dysregulated in a majority of cancer types could be accurately predicted by HE2RNA. The group developed lists of genes involved in cancer associated pathways, including increased angiogenesis, increased hypoxia, deregulation of the DNA repair system, increased cell-cycle activity, immune response mediated by B cells, and adaptive immune response mediated by T cells.
HE2RNA was able to significantly predict the activity of each of these pathways, and at least in some cancer types, could do so even more accurately than it could predict a random list of genes.
To illustrate potential clinical applications, the team also used microsatellite instability (MSI) status prediction as a diagnosis use case. Using 81 MSI-H patient samples from the TCGA collections, the team found that a "surprisingly high number of genes" were significantly well-predicted by HE2RNA.
A gene set enrichment analysis revealed an enrichment in T-cell activation and immune activation. Performing a similar analysis in non-MSI-high patients, the group saw mostly pathways involved in RNA metabolism and translation regulation.
Pronier and her colleagues wrote that previous research has shown that machine learning can predict MSI status directly from histology slides in some cancer types and they decided to investigate whether integrating HE2RNA gene expression prediction could improve this even further.
After a series of statistical analyses, the group concluded that a classifier based on the transcriptomic representation of WSI slides outperformed direct, image-based classification.
Spatial genomics
Another particularly exciting finding, Pronier said, was that model was also able to spatially locate specific gene mutations in a whole-slide image based on predicted expression. This could potentially allow researchers to create a virtual spatial transcriptomics map without actually having to perform RNA-seq.
The process operates as a kind of reverse of the prediction of RNA-seq – taking known transcriptomic data and inferring its spatial distribution in a corresponding tissue sample rather than vice versa.
"Even by learning the bulk expression, [HE2RNA] was able to predict localization of the expression of some genes," Pronier said. "Of course, it doesn't work for all 20,000 coding genes. But I think for me, this was the part that blew my mind the most. I've been waiting my career to get access to single-cell sequencing because it was too expensive. And the machine learning model can actually do this on an H&E slide."
In particular, the authors wrote, HE2RNA was able to spatialize differentially genes specifically expressed by T cells or B cells, "even though discriminating between those two types from their morphology alone is notoriously difficult."
Next steps
Moving forward, Pronier said that the Owkin team is working to better understand where and how HE2RNA, or other machine learning mechanisms being explored by the company, may have the most clinical value.
Using tools like HE2RNA to measure or predict immune infiltration is probably the lowest hanging fruit, she said, and the first area where the clinical community might feel comfortable that the approach is robust and can be used in diagnosis.
With this as a foundation, Owkin and others can try to push forward into more complicated clinical tasks, where gene expression prediction is more center stage.
Based on the data the researchers saw in their experiments, MSI could be one such use case, and would make sense as a target for test development because existing clinical assays offer an established ground truth against which to validate.
Beyond that, the group is also hoping to explore completely novel outcome- or therapy-predictive associations that might be uncovered by the HE2RNA approach. "Of course we may be limited [in that we are detecting] genes that may have a … morphological inference onto the cells," she said. "But if you think about it … these actually might be the thing that matters — the things that are actually important for disease."