CHICAGO – Researchers from Harvard University's Department of Stem Cell and Regenerative Biology and computer hardware-maker Nvidia have developed deep-learning technology that removes noise from sequencing so low-cell-count, low-coverage, or low-quality ATAC-seq data can still be useful.
Called AtacWorks, the software produces results from even single-cell studies that are similar in accuracy to earlier methods that require 10 times the number of cells, according to newly published research in Nature Communications.
ATAC-seq, short for assay for transposase-accessible chromatin using sequencing, measures open chromatin using a Tn5 transposase enzyme that inserts sequencing adapters into accessible regions of the genome.
ATAC-seq is widely used for epigenomics, and has evolved to include single-cell epigenomic analysis of rare cell types. However, the technology's effectiveness at detecting changes in accessible chromatin has always depended on signal-to-noise ratio and sequencing depth, and single-cell research is particularly sensitive to the quality of tissue.
The researchers from Harvard and Santa Clara, California-based Nvidia turned to deep learning to overcome these shortfalls, since the same kind of technology has helped remove noise from speech and fill in gaps in digital images.
AtacWorks is built on a residual neural network (ResNet) framework, which is widely used in image classification and localization. In creating the software, Nvidia adapted that model for genomics.
"We're taking the architecture that was originally developed for imaging, but instead we are feeding it DNA sequencing data," Avantika Lal, senior scientist on the Nvidia genomics team and lead author of the paper, said during an online press briefing. Harvard professor Jason Buenrostro, who developed the ATAC-seq method while a graduate student at Stanford University in 2013, is also listed as an author.
Nvidia claimed on its blog that whole-genome inference by this computing model takes less than 30 minutes with its tensor-core graphics processing units (GPUs), compared to 15 hours on a more traditional high-performance computing system with 32 CPU cores.
"Unlike previous deep learning methods for epigenomics, AtacWorks denoises ATAC-seq signal at base-pair resolution and simultaneously predicts the genomic location of accessible regulatory elements," according to the Nature Communications article.
AtacWorks "denoises" low-coverage and low-quality ATAC-seq signals, effectively upscaling them to a higher resolution and higher quality. The software has been trained to predict a coverage track of chromatin accessibility at base-pair resolution, as well as peak calls.
With the technology, Nvidia's collaborators at Harvard were able to identify two rare subpopulations of hematopoietic stem cells that had previously only been accessible in lymphoid-primed or erythroid-primed cells.
"This reveals new mechanisms of blood cell development, which we wouldn't have been able to discover without deep learning," Lai said.
The Nature Communications paper demonstrated how AtacWorks boosts the resolution of the chromatin accessibility signal for subsampled low-coverage bulk ATAC-seq data, and also removes noise from cell types that were not even part of the deep-learning training set because it learns "generalizable" characteristics of chromatin accessibility.
The latter feature allowed the researchers to analyze aggregated single-cell ATAC-seq data from a small number of cells at a time.
The Nvidia-Harvard team was also able to adapt AtacWorks to make cross-modality predictions of transcription factor footprints and ChIP-seq peaks from low-quality ATAC-seq inputs.
With AtacWorks, the investigators trained deep-learning models with bulk ATAC-seq data from four types of human cells: B cells, natural killer cells, and CD4+ and CD8+ T cells, sampling each to a depth of 50 million reads to come up with a clean, high-coverage dataset for each type. They identified peaks with MACS2, a peak caller often used on ATAC-seq data, then subsampled each dataset at depths as low as 200,000 reads and trained another computer model to reconstruct the clean datasets and peak calls from lower-coverage signals.
"ATAC-Seq allows us to identify variants that increase our risk of disease by changing the accessibility of our DNA, and it can also tell us which specific types of cells in our body are affected by these variants and how these changes in DNA accessibility can lead to disease," Lal said.
"Current methods for ATAC-Seq analysis require the signals from typically thousands or at least hundreds of cells to be aggregated together, and the fewer cells that you have, the noisier the signal that you get," Lal said during the press briefing.
The smaller the amount of DNA sequenced, the noisier the signal in ATAC-Seq and the less accurate the results. "It becomes much harder to identify these accessible regions, and this limits the resolution at which we can study biology," Lal said.
In the paper, researchers described how AtacWorks took noise-heavy data from human erythroblasts.
"Existing state-of-the-art methods were not able to identify accessible DNA from this," Lal said of the noisy signal fed into MACS2. AtacWorks, however, accurately predicted DNA accessibility at every position of the genome tested and also identified sites of accessible DNA that had previously been missed at such a low depth.
"This is an order-of-magnitude increase in the resolution at which we can study the biology of DNA," Lal said.
The researchers also applied their AtacWorks denoising technology to high-throughput single-cell ATAC-seq data. This process, they said, improved signal accuracy and peak calling from aggregated NK cells.
"Though we observed improved signal quality and peak calls for any number of cells, the results on 1 and 5 cell samples may be too noisy for downstream biological analysis, possibly due to single-cell heterogeneity not captured by the aggregate data used for training," the researchers wrote.
They also tested AtacWorks on transcription factor footprinting, which typically requires at least 100 million reads, and on predicting ChIP-seq peaks from low-input ATAC-seq. They successfully performed both with high accuracy despite smaller data samples than with previous technologies, according to the paper.
"These cross-modality predictions demonstrate the potential for AtacWorks to generate multiple layers of information in single cells from one of the most commonly-used epigenomic assays, at no additional cost," the authors wrote. "It is generally experimentally challenging to make multiple measurements from the same cells, so this approach may be especially useful in cases where running multiple ChIP-seq experiments is infeasible due to time, reagents, sample availability, or biological variability."
They also said that the technology "may be broadly useful for other deep learning applications in genomics, such as DNase, MNase, ChIP-seq, and the recently-developed method CUT&RUN, which has comparable high-throughput single-cell adaptations."
AtacWorks is freely available on GitHub, as well as through the Nvidia GPU Cloud (NGC) platform for scientific computing. Lal said that the software is free, but recommended that it be run on Nvidia graphics processing units.
Because it has been on GitHub for a year, Lal said that AtacWorks does have a user community outside Nvidia's collaboration with Harvard. However, this paper represents the first public release of AtacWorks-based research.
"Based on these advancements, we anticipate that AtacWorks will broadly enhance the utility of epigenomic assays, providing a powerful platform to investigate the regulatory circuits that underlie cellular heterogeneity," the paper concluded.
She said that the paper's reviewers asked if AtacWorks was effective for any type of DNA sequence. "We can train a model using whatever data we have available and then apply it to entirely new biological samples," she said during the press briefing.
She said it is targeted toward anyone in computational biology.