NEW YORK – Stanford University researchers are turning to a single-cell statistical modeling method to track down driver gene fusions that can propel cancers forward, uncovering fusions in individual tumor cells that appear more diverse and widespread than appreciated in the past.
During a presentation in a session on bioinformatics and artificial intelligence for cancer research and development at the second virtual session of the American Association for Cancer Research annual meeting on Wednesday, Stanford researcher Roozbeh Dehghannasiri outlined a computational strategy for identifying gene fusions from single-cell RNA sequencing (scRNA-seq) data using an algorithm called "Single cell precise splice estimation," or SICILIAN.
The unbiased statistical approach is designed to work on top of any conventional splice alignment methods to find a range of RNA variants in individual cells profiled by scRNA-seq, Dehghannasiri explained, from circular RNAs to splicing isoforms and gene fusions.
Fusions in particular may serve as treatment targets or as biomarkers for tumor features and progression, he noted, pointing to advances that have been made in treating chronic myelogenous leukemia since the early 1990s by targeting a BCR-ABL1 gene fusion in the blood cancer with a tyrosine kinase inhibitor. That has spurred interest in more fully understanding the suite of gene fusions and their effects across cancer, particularly given the vast amounts of RNA sequence data that is now available.
"We have the unique opportunity of using hundreds of publicly available, massive sequencing databases to have accelerated discovery of gene fusions and RNA variants in general, using precise computational methods" Dehghannasiri said. "However, despite this great promise, there are many aspects of gene fusions that are still unknown and the main reason for that is that the current methods still suffer from high false positives and false negatives."
In an effort to get a more precise picture of the fusions present in individual tumors cells profiled by single-cell RNA-seq, he and his team developed SICILIAN to overcome fusion detection problems related to information dropouts, low-coverage RNA-seq data, sequencing noise, alignment biases, and other issues that arise in scRNA-seq profiles.
The SICILIAN framework focuses on candidate splice junctions found in alignment files such as BAM files, Dehghannasiri explained, using a generalized linear model that takes a range of alignment features into account to assign aggregated statistical scores to each proposed junction. Those scores are intended to help weed out false-positive or -negative fusions that turn up using conventional fusion detection tools, such as read count-based methods, by removing candidate fusions below a certain statistical threshold.
"Only considering fusions with high enough statistical scores can dramatically increase the precision of detection over typical detection strategies, such as filtering on the number of aligned reads or using ontology-level heuristic filters," Dehghannasiri and his co-authors wrote in an abstract accompanying the AACR presentation.
To further account for false positives related to sampling on large numbers of single cells, meanwhile, the team integrated a multiple hypothesis testing correction into the SICILIAN pipeline, calculating a median statistical score for a given junction across many individual cells considered from a given sample.
After benchmarking the SICILIAN tool using published scRNA-seq and bulk RNA sequencing data for five lung adenocarcinoma cell lines, the researchers applied the algorithm to scRNA-seq profiles for around 75,000 individual lung or blood cells assessed by 10x Chromium and SmarSeq2 for a lung cancer atlas project that Stanford and the Chan Zuckerberg Biohub researchers described in a BioRxiv preprint last year.
In that atlas, Dehghannasiri and his team used SICILIAN to track down fusions missed using read count approaches, including some splice junctions reported from prior analyses of Genotype-Tissue Expression project data and the related CHESS gene-transcript catalog.
"Before SICILIAN, only 10 percent of junctions were shared with those databases," Dehghannasiri noted. "After SICILIAN, 50 percent of junctions are shared with the databases, which highlights that SICILIAN, in an unbiased way, can enrich for known splice junctions."
Similarly, the researchers noted that the algorithm appears to compare favorably to existing methods such as STAR-Fusion for finding chimeric fusions, which may lack telltale alignment features found in other splice junctions. To overcome such concerns, they again used statistical modeling to score potential chimeric fusions, Dehghannasiri explained.
After training that chimeric fusion detection model, for example, the investigators reported that SICILIAN could pick up fusions with enhanced specificity and lower false positive calls — but also fewer true positive calls — compared to the STAR-Fusion software in simulated datasets containing hundreds of true fusions.
Going forward, Dehghannasiri noted, the team is collaborating with other investigators at Stanford to apply SICILIAN to large single-cell datasets in an effort to come up with more complete maps of the splicing and fusion patterns in tumor and normal cells.