NEW YORK – A computational biology group at Imperial College London and the University of Cambridge recently published a novel method for detecting and discriminating between cancers that they believe could be an engine for improved multi-cancer screening tests in the clinic.
Dubbed EMethylNET, the method employs a machine/deep learning algorithm that the team trained on cancer tissue datasets from 13 cancer types. The resulting classifier, when given an unknown sample, could distinguish a cancer sample from a normal sample and detect the cancer type with nearly 98 percent accuracy.
Typically, the analytic and computational methods at the heart of genomic tests are closely held, patented, and sparsely described by the companies that claim them. But the UK researchers have made their data, AI models, and code freely available to the scientific community, said Shamith Samarajiwa, the paper's senior author and the head of the computational biology and genomic data science group at Imperial College London.
"It's been known for a while that epigenetic changes … such as DNA methylation pattern changes … occur early in the cancer-forming process," Samarajiwa said in an email.
In their study, published June 20 in Biology Methods & Protocols, he and his colleagues described their development of EMethylNET using what is known as an XGboost model — a type of machine learning — to identify cancer type-specific methylation changes, using tumor tissue methylation data from TCGA, which could then be fed into a deep neural network to develop a predictive classifier.
The result was a discriminator for the presence or absence of cancer and for the type of cancer, with striking accuracy of nearly 98 percent in the initial training set.
When the team then applied EMethylNET to independent, non-TCGA data representing 940 cancers and controls across nine cancer types, its accuracy was over 80 percent for all but one cohort, and more than 90 percent for half of the test sets.
The two main exceptions to this high performance were the head and neck cancer and colon adenocarcinoma cohorts.
In the latter, EMethylNET classified all adenomas, which were labelled as normal in their initial study, as cancer. However, the authors wrote that in clinical practice adenomas are commonly removed when found during colonoscopy, suggesting that sensitivity to these precursor lesions could actually be a boon to the classifier.
For the head and neck cancer independent dataset, the authors wrote that the low performance could stem from the fact that samples represented only one tissue of origin, the oropharynx, which made up less than 2 percent of HNCs in the TCGA training data. Half of the cancers in this cohort were also HPV-driven, and these viral tumors are known to display different methylation patterns than their nonviral counterparts.
Samarajiwa said that an important distinguishing factor from other algorithms is that EMethylNET allows for further investigation of the location of the methylation changes that the algorithm considers in its classifications.
"The methylation patterns identified by EMethylNET enable us to understand where these changes are and why the algorithm thinks these are important. This makes our method both interpretable and explainable unlike most AI [approaches], which use uninterpretable black box methods," he said.
For example, the team was able to determine that the majority of the cancer-specific methylation changes that ended up in their classifier target the same genes and pathways that are typically disrupted by mutations.
According to the authors, their overrepresentation analysis revealed that the genes assessed by EMethylNET were enriched in processes linked to cancer hallmarks. Mining the scientific literature, the team found that cancer-associated methylation changes in 892 of its targets were supported by 7,831 publications. The algorithm's methylated gene set included 229 known tumor suppressors and oncogenes, 546 transcriptional regulators, and many noncoding RNA genes, which the authors wrote are increasingly recognized as playing a key role in carcinogenesis.
Samarajiwa said that although EMethylNET was trained on just 13 cancer types, it could be extended to detect hundreds of cancer types depending on the availability of adequate training data.
An open question for the method in terms of clinical translation for multi-cancer screening, is how its accuracy might hold up when applied to circulating tumor DNA, an area the researchers plan to investigate. Practically, cancer early detection requires screening of an asymptomatic population via blood samples.
As such, Samarajiwa said that while the predictive accuracy he and his team saw in their study was higher than what other methods have demonstrated in blood — including Grail's clinically available Galleri test — it should not be directly compared to blood-based results.
The investigators also highlighted diagnosis of metastatic cancers of unknown origin as a potential future application, although they wrote that the current method hasn't yet been optimized for that use.
"Coupled with new sequencing methods such as nanopore sequencing that also directly provide methylation information, this may be a way to diagnose cancer accurately (and at a much lower cost) in the future," Samarajiwa said. "With more training datasets and clinical testing, we believe that we can improve our methods to be clinically useful."