Skip to main content
Premium Trial:

Request an Annual Quote

UCSD Lab's Tools Stand out in Single-Cell ATAC-seq Data Analysis Benchmarking Study

Premium

NEW YORK – Researchers in Switzerland have published a study benchmarking computational tools for analyzing single-cell ATAC-seq (assay for transposase-accessible chromatin by sequencing) data that have become available over the last five years, providing users with advice for selecting the best method for their particular application.

The team, comprising researchers from ETH Zurich and the University of Zurich, ran several datasets through eight "feature engineering pipelines" derived from five different methods in order to discover and discriminate cell types.

"Our analysis provides guidelines for choosing analysis methods for different datasets," the authors wrote in a paper published last month in Genome Biology, noting that SnapATAC and SnapATAC2 —bioinformatics packages developed in Bing Ren's lab at the University of California, San Diego — generally outperformed other methods, especially for datasets with "complex cell-type structures."

However, "we wouldn't say that [SnapATAC2] is the universally best choice," the authors said in a statement provided to GenomeWeb. "It is not the most memory-efficient method despite using on-disk storage, a similar strategy as in ArchR," one of the other packages evaluated in the study. "Another thing that is not the focus of our benchmark but can be relevant to users is that, as a toolkit package, SnapATAC2 is not as comprehensive as ArchR and Signac, because it doesn’t include downstream functionalities like motif analysis or co-accessibility analysis."

"I think the paper is solid and does a good job assessing the tools in several contexts and leads to the conclusion that is already pretty obvious, that you generally want to use different tools for different datasets, and it is worth assessing several," said Andrew Adey, a single-cell sequencing expert at Oregon Health & Science University who has used SnapATAC2, among other tools.

The study provides a fresher look at single-cell ATAC-seq data analysis options than a 2019 benchmarking study, also published in Genome Biology, especially with the inclusion of SnapATAC2, which was published in January of this year. ArchR, from William Greenleaf's Stanford University lab that pioneered single-cell ATAC-seq, and Signac, from single-cell sequencing bioinformatics maven Rahul Satija of New York University and the New York Genome Center, were both introduced in 2021.

Moreover, the study offers a focused look at single-cell ATAC-seq data analysis. "While the single-cell transcriptomics field has matured, and to some degree converged, methodologically, for single-cell chromatin assays, there remains a lot of major unknowns," the Swiss team said. "In particular, there are critical ways in which scATAC-seq data differs from single-cell RNA-seq and prevent a direct application of methods developed for the latter."

Unlike single-cell transcriptomics, which is able to count expressed genes, "features are not defined a priori for ATAC-seq, and typical analyses rely either on tiling over the whole genome or calling peaks (i.e., candidate regulatory elements) from the data itself, both of which come with their own issues and limitations," the study authors noted.

Even within one package, such as Signac, there can be multiple options with little information available to users on which ones to choose. In addition to feature aggregation and SnapATAC, the study evaluated both peak-calling and tiling options in ArchR, "all cell peaks" and "by cluster peaks" options in Signac, and SnapATAC2's "cosine" and "jaccard" options.

The study ran six public datasets through each pipeline. The data came from the original ArchR publication, a single-cell chromatin accessibility atlas published in 2021 by Ren's lab, a study from Greenleaf's lab on human hematopoietic cell differentiation, a dataset of peripheral blood mononuclear cells provided by 10x Genomics, and a 2019 study of single-cell gene expression and chromatin accessibility published in Nature Biotechnology.

"We were surprised to observe that the way features are defined (e.g., peaks versus genome tiles, overall peak calling versus per-cluster) is not as critical as we expected," the authors told GenomeWeb. However, the number of features used "[made] a major difference" and even explained some of the differences between the methods. "For example, ArchR performance improves when using more features than by default, and SnapATAC did not perform as good with lower-than-default numbers of features," they said.

In addition to the various packages, which all select a subset of "features," the study used an "aggregation method [which] clusters correlated features and then sums them up into meta-features," Siyuan Luo, a doctoral candidate at the University of Zurich and the first author of the paper, told GenomeWeb. "This has the advantage of using all the information (albeit in a less-specific form), and of being easier to properly normalize. Then standard methods can be used downstream. … We included it out of curiosity and were rather surprised by its good performance. But at the moment, it's chiefly a proof of concept of the aggregation strategy."

While the study presented both simple and complex datasets, as defined by the cell types they contained, Adey said he would like to see how the methods perform on noisy datasets. "Lots of tissues we work with generate noisy data, and it is all we can get, regardless of the technology used," he said. "Some of these tools (the ones with iterative clustering) will generate beautiful looking clusters, but half of them are a mix of cells from very different cell types. We have found running the iterative ones with only one iteration is best, but then they perform more comparably to other methods."

Identifying rare populations or highly related subpopulations within complex tissues is still challenging, the authors noted, "due to the data sparsity and low signal-to-noise ratio — none of these methods are always performing perfectly in our benchmark."

Identifying the regulatory elements that define the identity of rare subpopulations is also a remaining challenge. "While scATAC-seq identifies regions of open chromatin, linking these regions to their functional roles, such as controlling the expression of specific genes, remains complex," they said.