NEW YORK (GenomeWeb) – A Wellcome Trust Sanger Institute-led team has developed a computational tool to cluster single-cell RNA-sequencing data, which they said could help define different cell types.
Called single-cell consensus clustering, or SC3, the unsupervised clustering tool combines multiple clustering solutions using a consensus approach. As the researchers reported in Nature Methods today, they found their open-source tool to be highly accurate and stable, and applied it to the transcriptomes of two myeloproliferative neoplasm patients to identify subclones.
"We created the new SC3 tool to analyze complex single-cell RNA-sequence data, and showed that it is more robust and accurate than existing methods at grouping cells," first author Vladimir Yu Kiselev from the Sanger Institute said in a statement. "The SC3 tool contains added features that help interpret the biological function of the cells in that group, such as lists of marker genes for each group."
SC3, which is an R package, uses a parallelization approach to generate a set of clusterings. Those clusterings are then combined into a consensus matrix that notes how often a given pair of cells falls within the same cluster. The consensus matrix then undergoes complete-linkage hierarchical clustering into a user-specified number of groups.
The researchers noted that its integration with Bioconductor and scater should make SC3 easy to incorporate into existing workflows.
The researchers compared their approach to five other methods and found that SC3 generally performed better. In their benchmarking, they considered accuracy as well as stability.
But although SC3 was highly stable, they noted that it came at a computational cost — it took about 20 minutes to run 2,000 cells. By reducing the numbers of runs considered, they could boost the number of cells clustered in 20 minutes to 5,000, though they also saw a dip in accuracy.
For larger datasets, the researchers used a hybrid approach consisting of both unsupervised and supervised methods in which SC3 selects a set of 5,000 cells at random for clustering. Those results were then used to develop a support vector machine to label the remaining cells. They reported that their analysis of Drop-seq dataset of 44,808 cells with this approach was in broad agreement with the original study.
The researchers also reported that SC3 could help interpret clustering results by uncovering differentially expressed genes, marker genes, or outlier cells. In particular, they utilized this capability to analyze single-cell RNA-seq data from two patients with myeloproliferative neoplasms, a pre-malignant condition in which terminally differentiated myeloid cells are overproduced. For one patient, the tool uncovered three clusters, while it found a single cluster in the other.
By growing cells from these patients, the researchers found that patient one indeed harbored three subclones — one with mutations in both TET2 and JAK2V617F, one with just TET2 mutations, and one of wild-type cells — while patient two had a single clone with TET2 and JAK2V617F mutations.
"The SC3 tool was able to use patterns of gene expression to distinguish, within an individual cancer, subclones that carried different mutations," Cambridge University's Anthony Green added. "This approach will help us define the cellular heterogeneity within each cancer, an important step towards improving cancer treatment."