NEW YORK – A new DNA methylation atlas that may be the most comprehensive to date in terms of characterizing methylation by tissue type aims to improve our ability to decipher the tissue of origin of cell-free DNA (cfDNA), helping researchers and physicians detect disease and monitor treatment response.
In a proof-of-concept study published Monday in PNAS, researchers from the University of California, Los Angeles and molecular diagnostics firm Early Diagnostics used the atlas to develop a first-of-its-kind supervised deep learning tissue deconvolution algorithm to trace the origins of circulating cfDNA.
The method, called cfSort, is being licensed to UCLA spinout Early Diagnostics, while the atlas is publicly available.
The atlas consists of 521 samples covering 29 major types of noncancerous human tissue from the GTEx Consortium, all sequenced via high-resolution reduced representation bisulfite sequencing (RRBS).
Jasmine Zhou, cofounder and co-CEO of Early Diagnostics and the study's senior author, explained that many other tissue methylation datasets average methylation levels across all DNA fragments, whose heterogeneity can obscure tissue-specific signals arising from minor cell populations. In contrast, Zhou and her colleagues built their atlas by analyzing methylation signals at the level of individual DNA fragments.
Billy Lau, a Stanford researcher specializing in DNA methylation sequencing, said that the atlas is "a big achievement in terms of the sheer number of samples and tissue types. It will be a good resource amongst the other methylation atlases that are out there."
He also commented that he appreciated the UCLA team making both the raw sequencing data and processed outputs available.
The UCLA team's strategy runs parallel to that used earlier this year by a group from the Hebrew University of Jerusalem to construct a DNA methylation atlas of various cell types. That atlas was built via whole-genome bisulfite sequencing of 39 cell types from 205 healthy individuals.
"The two atlases are very complementary," said Shuo Li, the UCLA study's lead author.
Zhou, Li, and their colleagues used in silico tissue samples derived from their tissue-type methylation atlas to develop a supervised computational method called cfSort to determine the tissue of origin of cfDNA samples.
Supervised machine learning relies on labeled input and output training data, in contrast to unsupervised learning, which works on raw data.
"With supervised learning," Zhang said, "you can pull in as much training data as possible and the more training data you have, the more accurate is your prediction."
In a test of cfSort's analytical performance using in silico DNA samples, the method achieved a higher accuracy than existing tissue deconvolution methods such as the non-negative least square (NNLS) method and CelFiE. For instance, cfSort demonstrated a 0.997 Pearson's correlation between estimated and known tissue fraction, compared to 0.933 and 0.992 for NNLS and CelFiE, respectively.
Stanford's Lau commented that these marginal improvements over existing methods open questions as to cfSort's utility.
The use of a deep learning framework for deconvolution in this case, he said, "may be completely overkill."
"Other simpler machine learning methods that don't use neural networks could be just as competitive with some simple tweaks," he commented. "It's important to note that other methods were not bad at all. CelFiE was really competitive despite not using deep learning."
Lau added that he is interested to see in future studies how cfSort's performance breaks down at lower coverage, noting that the current study tested from 20X coverage to 120X, "which is only feasible with RRBS or other targeted methods."
One such targeted method is cfMethylSeq, a cost-effective, genome-wide cfDNA methylation profiling assay that Zhou and her group invented and published last year, and which is also licensed to Early Diagnostics.
"CfMethylSeq is the experimental assay, and cfSort is the computational method to deconvolute cfMethylSeq data," Zhou said.
As a test of implementing this pipeline, cfSort was applied to cfMethylSeq data from plasma cfDNA samples from 100 healthy individuals, 21 cirrhosis patients, and 201 cancer patients (98 lung, 27 liver, 47 colorectal, and 29 stomach cancer patients), where the algorithm identified high tissue fractions in all disease samples.
The investigators also tested the use of cfSort in monitoring treatment response in cfDNA data from four non-small cell lung cancer patients receiving anti-PD-1 immunotherapy. In this test, cfSort showed that cfDNA levels from the liver and kidneys consistently changed with biochemical test results for those organs. By identifying potential tissue damage related to cancer treatments in non-cancer tissue, this demonstrated cfSort's potential for monitoring treatment side effects. Although more testing would be required to prove it, this result also raises the possibility of a way to monitor for side effects in organs that lack standard biochemical markers.
In the end, Zhang said, "we want to use cell-free DNA to create a general health monitoring tool."
A provisional patent has been filed for cfSort and Zhang and her colleagues are currently planning more studies applying the method to different disease types.
"There has been a lot of work in deconvoluting methylation profiles and generating methylation atlases," said Lau. "It's important that works like this exist for the scientific community."