CHICAGO – A group from the University of California, San Diego (UCSD) has improved the detection of mosaic variants, which has historically been confined to oncology because there was not enough nonclonal data in other areas of medicine to produce accurate calls.
Earlier this month, they described a new variant caller, DeepMosaic, a control-independent, image-based convolutional neural network classifier for single-nucleotide mosaic variants in noncancer diseases, in a paper published in Nature Biotechnology.
The UCSD team, assisted by the Brain Somatic Mosaicism Network, a National Institute of Mental Health-funded research consortium, wrote that noncancer mosaic variant detection has long been "computationally challenging due to the sparse representation of nonclonally expanded [mosaic variants]."
DeepMosaic is meant to complement and supplement existing variant callers that either are optimized for cancer or that focus on germline variants. "This is the next generation of somatic mutation caller," said lead author Xiaoxu Yang, a postdoctoral neurosciences scholar in the laboratory of Joseph Gleeson, director of neuroscience at the UCSD-affiliated Rady Children's Institute of Genomic Medicine at Rady Children's Hospital-San Diego.
DeepMosaic also incorporates population allele frequency data to improve detection of somatic mutations.
The DeepMosaic name is a nod to Google's DeepVariant, though that deep-learning-based variant caller is built for germline variants. The Broad Institute's MuTect, Illumina's Strelka, and Washington University's SomaticSniper are meant to detect tumor mutations.
"The goal of DeepMosaic … is to provide a new platform and introduce a new concept for mosaic variant detection to the field, with users training their own models," Yang said. "We also showed that deep-learning models trained specifically for cancer are not necessarily performing well in noncancer samples."
Yang said that DeepMosaic "theoretically" can also call germline mutations, but that it was optimized for somatic variant calling. "We don't want to just reinvent the wheel," he said.
The authors trained DeepMosaic on 180,000 mosaic variants — gathered from research and simulated data — and benchmarked the software against more than 600,000 mostly simulated variants, plus some real-world exomes and genomes. They said that their software achieved more than twice the validation rate of "previous best-practice methods" for detecting noncancer variants from exome data.
The developers provided training scripts and variant calls for users to download from GitHub and train their own DeepMosaic models.
The UCSD team said that mosaic variants tend to account for about 5 to 10 percent of the "missing genetic heritability" in upwards of 100 known human diseases, but the allelic fractions in nonclonal disorders are "frequently an order of magnitude lower" than in cancer or precancer mosaicism. This may be because earlier variant callers built on "classic" statistical models are "often optimized for higher [allelic fraction mosaic variants] seen in cancer," according to the paper.
Newer software such as MosaicHunter — which Yang helped develop — and MosaicForecast might include machine learning but similarly do not include sequence and alignment data. "While these are useful proxies, they represent a limited window into the sheer wealth of information contained in raw sequencing data," the UCSD team wrote.
These methods also require human inspection of alignments in genome browsers, making them difficult to implement on a large scale, the paper said. The authors addressed this limitation by packaging a visualization module with a convolutional neural network for detecting mosaic variants.
A second paper from many of the same authors appeared in Nature Genetics last week, describing benchmarking data for DeepMosaic for malformations of cortical development (MCD). In this work, the researchers analyzed multiomic data to profile MCD samples, a common trigger of epileptic seizures. They used DeepMosaic in combination with MuTect, and only for analyzing exome data.
Yang said that he and his UCSD colleagues plan on training DeepMosaic on tumors and comparing it to NeuSomatic, a neural network-based oncology variant caller developed by Roche and Microsoft. They also want to train some models for single-cell variant calling.
Xiaotu Ma, a computational biologist at St. Jude Children's Research Hospital in Memphis, Tennessee, who has an interest in germline mosaicism, took issue with some of the assumptions and conclusions in the Nature Biotechnology paper.
Ma said that genome sequencing is not sustainable today above about 200X, while exome sequencing might achieve 500X to 1,000X, but those numbers are stretching affordability. He believes that it would take as much as 2,500X depth of exome sequencing to attain DeepMosaic's stated goal of detecting variants at 0.5 percent to 1 percent frequency.
"If you don't have enough depth, you don't have enough power," Ma said. "It's not going to be achieving what they want." Yang said that his team already showed that DeepMosaic can detect mosaic variants with frequencies as low as 1 percent using 300X whole-genome sequencing, but that might not be feasible routinely with current sequencing technology.
However, the cost of sequencing keeps falling, and several new sequencing platforms launched last year promise higher depth of genome sequencing at lower cost.
DeepMosaic's machine learning was trained on genome and exome sequencing data at read depths of 30X to 500X, with exomes representing the larger numbers. The researchers benchmarked their training dataset against a "gold-standard validation dataset" from the Brain Somatic Mosaicism Network.
"Higher read depth is definitely better, but at genome scale, the cost of sequencing will be too high," Yang said. At typical depths of 30X to 50X, "we might be able to reliably detect mosaic variations [with frequencies] as low as 5 to 10 percent," he added.
"We showed that DeepMosaic might be able to detect some real variants at lower sequencing depth," Yang said. "The limitation of the cost of sequencing is not DeepMosaic itself."
Ma said he dug into the raw data that Yang and colleagues presented in the supplement to the Nature Biotechnology paper. He said that it is part of his daily practice to analyze data from anyone claiming to have a better tool than what St. Jude currently uses.
From that analysis, he criticized claims that some markers had allelic fractions of 50 percent. "It's heterozygous. It has nothing to do with mosaicism," he said.
Ma did say that DeepMosaic solves a problem that UCSD has. "I think it's going to be very useful for their existing data," he said. "Whether or not it's broadly useful for other communities, I really don't [know]."
Yang noted that variant calling remains an imperfect science, so he recommended that scientists continue to compare results from multiple callers.