NEW YORK (GenomeWeb) – Researchers led by a group at Spain's National Center for Genomic Analysis-Center for Genomic Regulation (CNAG-CRG) have developed a new analytical tool that enables large-scale processing of single-cell sequencing data.
While a wide range of existing single-cell RNA sequencing (scRNA-seq) approaches has allowed researchers to simultaneously process thousands of cells, current analytical tools cannot keep up with the large datasets created by these experiments and frequently lack sensitivity to find marker genes.
In order to tackle background noise and sparsity of scRNA-seq data, the new tool, called BigSCale, applies a numerical model that identifies differences between specific cells and groups of cells.
"This is important [in order] to define subpopulations and groups, because you end up with a higher number of markers, which help identify or annotate [certain]clusters to cell types or function, and predict the function of unknown cell types," CNAG-CRG team leader and senior author Holger Heyn explained.
In a study published last month in Genome Research, Heyn and his team evaluated the performance of BigSCale using both a biological model of aberrant gene expression in a patient's neuronal progenitor cells and simulated data sets. The technology's framework includes modules for differential expression analysis, cell clustering, and RNA biomarker identification.
In order to generate the numerical model, the team first grouped together cells that feature similar transcriptomes. Modeling differences in expression levels, the method assigns P-values to each gene, representing the likelihood of a change in cell expression. Genes repeatedly differing in expression between cells receive higher scores.
The method then performs cellular clustering by computing all pairwise cell distances to create a distance matrix and to assign cells into groups depending on their phenotypes. It computes a distance matrix over a set of genes presenting a high degree of variation across the dataset. Skewed and isolated genes, including gender-related or cell cycle-related genes, can be identified and removed to improve the cluster information.
After cell cluster identification, BigSCale performs an iterative differentially expressed analysis between populations of cells to detect biomarkers, defined by genes unevenly expressed across populations.
According to the study's authors, BigSCale examines multiple alternative phenotypes of a cell by organizing markers into a hierarchical structure, where "increasing layers of phenotypic complexity ... are represented by markers at increasing hierarchical levels."
In the study, the team decided to convolute information from cells with corresponding transcriptomes into index cell (iCell) profiles in order to analyze datasets of up to one million cells. According to the study authors, the iCells can "preserve the transcriptome information from individual cells and can be deconvoluted for targeted analysis of populations of interest."
The team evaluated the method by comparing its performance to standard scRNA-sequencing techniques on 1920 neuronal progenitor cells (NPCs) from two patients with Williams-Beuren syndrome and two with Dup7 syndrome. They found that BigSCale displayed the highest sensitivity at all tested specificity levels, suggesting that the technique outperforms other methods for single-cell differential expression analysis in sensitivity when using biological data.
Evaluating BigSCale's speed in differential expression analysis, the team found that the method proved to be the fastest tool in the biological model, producing results in three minutes. Comparing the scalability of BigScale to MAST — another single-cell analytical tool — in larger sample sizes, the team performed differential expression analysis in a simulated matrix of 40,000 genes in 32,000 cells and saw that BigSCale analyzed the samples faster in all conditions. In addition, BigSCale could analyze datasets larger than 8,000 cells, while MAST was not able to do so due to its RAM requirements.
To test the tool's ability to analyze large datasets, the team applied BigSCale to examine 1.3 million cells derived from the developing mouse forebrain. By convoluting the cells into iCells, the researchers identified rare populations, including reelin-positive Cajal-Retzius neurons, for which they found previously unrecognized heterogeneity that was linked to specific differentiation stages, spatial organization, and cellular function.
"Differentially expressed marker genes between subpopulations help the researchers to link cells to prior knowledge about the tissue anatomy or to describe the functions of newly discovered cell types," Heyn explained.
While Heyn's team at CNAG-CRG has shaped BigScale for large-scale gene expression data analysis, research groups around the world have also developed or used approaches with similar goals for single-cell genomics. Researchers at Helmholtz Zentrum München in Germany, for example, developed a new tool called Scanpy that also uses a graph-like coordinate system to characterize cells by identifying their closest neighbors.
Heyn argued that BigSCale technology differs from standard methods because of its numerical model and streamlined process. The model predicts the difference between two cells or groups of cells, allowing for higher efficiency and improved sensitivity.
While most of the current tools address single tasks within the analytical process — including batch effect correction, clustering, and trajectories — BigSCale offers a "complete solution" that integrates primary data processing, differential expression analysis, cell clustering, gene marker selection, and data convolution, Heyn said. The user can then apply the framework for a complete data analysis and quicker interpretation.
As part of the Human Cell Atlas (HCA) Project, Heyn's team is coordinating a benchmark program to systematically compare single cell genomics techniques. His team has designed a reference sample — human peripheral blood mononuclear cells, mouse colon cells, and different cell lines — and will ship it to partnering research universities around the world. Each lab will perform specific RNA sequencing analysis techniques on the sample cells — simulating tissue-derived data produced in the HCA project — and generate sequencing data in order to perform a cross-platform comparison of the methods.
"If you have a reference genome, you can map out new experimental datasets, and compare them to the reference cluster, noticing similarities and differences," Heyn explained.
According to Heyn, the researchers are developing a second version of the single-cell analysis tool — BigSCale2 — that will include additional analytic modules that facilitate data interpretation for the user. Not only will BigSCale2 cluster and define cell types, but it will also have options to extract correlating gene sets and perform a trajectory analysis. Heyn's team aims to process and convolute other large datasets, as well as offering the iCells to the public to increase the data's public usage.
Heyn said that his team does not plan to commercialize the technology, noting that the software and coding behind the BigSCale technology are currently open source.
Within the clinical space, Heyn noted, the BigSCale method may be applicable for certain procedures in the future, as many large-scale datasets related to quantification are not single-cell RNA-sequencing related. The authors of the study also noted that they "foresee a potential application of the convolution strategy in other large data types, such as single-cell mass cytometry data."
In the future, Heyn noted that his team will also use the bigScale tool on cross-annotating large datasets and observing similarities between different cell clusters. In addition, his team has multiple national and international collaborations ongoing and aims to further extend them to larger consortia on biomedical and clinical research.
"Whenever you can define differences between the cells or samples, across a large dataset, the technique could [act as] a suitable pipeline to sensitively quantify and define differences between healthy and infected samples," Heyn said.