CHICAGO – Computational and systems biologists at the Agency for Science, Technology, and Research (A*STAR)'s Genome Institute of Singapore have devised an artificial intelligence-powered method of somatic variant calling that they claim removes human variability while also improving the scope of mutation detection.
VarNet developers described the software as an "end-to-end deep learning approach for identification of somatic variants from aligned tumor and matched normal DNA reads."
VarNet applies deep learning trained on whole-genome tumor sequences to predict somatic single-nucleotide variants and indels. The software then creates "image representations" of aligned tumor-normal reads, featuring annotations related to base quality, mapping quality, and strand bias.
A*STAR researchers built the image representations from 4.6 million "high-confidence" somatic variants from whole genomes of 356 tumors, including 2.5 million data points for the SNV model and 2.1 million for the indel model.
The bioinformaticians chose to use image representations, because, according to lead author Kiran Krishnamachari, deep learning has seen a lot of progress in terms of "computer vision" in recent years, meaning the ability for computers to make sense of images. For VarNet, the A*STAR researchers generated images of genomic sites to represent features of each sequence such as base and mapping quality.
Data for training the algorithm came from the Cancer Genome Atlas as well as from whole-tumor sequencing data produced by hospitals and research institutes in Singapore. A*STAR said that VarNet is suitable for both clinical and research use.
The A*STAR bioinformaticians trained the deep learning on these image representations to help VarNet predict the probability of mutations. The authors said their method could "augment and potentially supplant human-engineered features and heuristic filters in somatic variant calling."
Using "weakly supervised" deep learning, VarNet generates "high confidence pseudo-labels" for these whole tumor genomes — sequenced at depths of 50X to 150X — covering lung, sarcoma, colorectal, lymphoma, thyroid, liver, and gastric cancers, according to the paper.
Anders Skanderup, leader of the Genome Institute of Singapore's computational cancer genomics laboratory, explained that "weak supervision" means that the training data is not fully curated by humans, so algorithms fill in the gaps.
"Deep-learning models typically require vast labeled training datasets to perform robustly. This poses a challenge to training deep-learning models for detecting cancer mutations, as it would require significant human effort to create such a dataset," he told GenomeWeb via email.
By employing a large dataset to train its algorithms, VarNet is able to avoid the need for a massive team of human data curators. This "enables the successful use of weak supervision for cancer mutation detection," Skanderup said.
Skanderup said that VarNet is different from earlier callers that detect cancer mutations because it "learns directly from the raw DNA sequencing data and avoids the need for expensive manual labeling of mutations using weak supervision."
Skanderup called VarNet "the first successful application of end-to-end deep learning for cancer mutation detection." He said that the software breaks ground by learning directly from raw DNA sequencing data rather than through "expensive manual labeling" of genetic mutations.
Earlier tools for somatic variant calling tend to analyze genomes based on statistical models of variant allele frequencies, then filter the results to remove false positives. The addition of machine learning helps researchers handle the burgeoning volume of data.
Krishnamachari, a Ph.D. candidate at National University of Singapore and an A*STAR-affiliated AI researcher, said previous models based on human-designed statistical and probabilistic models for variant identification often make assumptions about sequence context in the interest of simplicity. Sometimes, those assumptions are wrong.
"Deep-learning models are data-driven models that do not require us to model these relationships individually," Krishnamachari said via email. "With large enough training datasets available, we can therefore provide more information to the model so it can extract the meaningful features on its own."
Common variant callers that use machine learning for cancer diagnostics today include Illumina's Strelka2, Roche's Neusomatic, and the A*STAR-developed Somatic Mutation calling method using a Random Forest (SMuRF). The paper also discussed DeepVariant, a germline variant caller from Google.
"Intriguingly, deep-learning models operating on raw DNA read alignments may learn rich representations of reads comprising both their complex interdependencies as well as the sequence context around mutated sites," the authors wrote. "However, this concept has not been explored for somatic variant calling, where variants have to be evaluated in the context of deeper tumor sequencing data, intratumor heterogeneity, and matched normal reads."
Previous variant callers relied on "human-engineered features" for predicting mutations, according to the paper. For VarNet, A*STAR bioinformaticians trained deep-learning models on enriched representations of real mutations from raw sequence alignments. "Conceptually, this process is mimicking how human experts often manually visualize and curate somatic mutations," they wrote.
The Singapore researchers benchmarked VarNet against whole genomes of both real and simulated mutations, including synthetic tumors created for the DREAM Somatic Mutation Calling Challenge nearly a decade ago.
They found that the new method outperformed most but not all existing variant callers. Notably, VarNet performed well in genome regions that are not highly alignable.
"These results suggest that the deep-learning approach is able to successfully learn and generalize when presented with sufficiently large datasets using weak supervision," the researchers wrote.
"Overall, VarNet made calls at higher precision and recall compared to other callers for both SNVs and indels," the authors wrote. They also said that the software was more accurate than other callers for both low and high variant allele frequency levels.
However, Skanderup said that somatic variant calling "is not yet considered a 'solved' problem," as it is still plagued by sequencing errors and mutant allele fractions. "Modern machine-learning methods such as deep learning provide an opportunity to close the gap on this problem," he said.
Notably, VarNet was not able to solve a persistent problem with indel calling, though the developers are optimistic that additional training of the deep-learning model can make progress in this area. "We found in our experiments that indel calling benefited from additional training samples more than SNV calling, since indels occur at lower rates than point mutations and are also more difficult to pseudo-label accurately," they noted.
They further noted that self-training has proven successful in other deep-learning bioinformatics systems, specifically naming protein structure prediction software AlphaFold 2 from Google-affiliated DeepMind Technologies.
Though VarNet has been trained on the seven different cancer types, Skanderup said that A*STAR has not yet studied whether it can detect somatic variants for other diseases.
Meanwhile, Krishnamachari said that he and colleagues are continuing to improve the size and quality of the VarNet training dataset.