NEW YORK – Gene-specific machine learning can be more useful than a disease-specific approach when trying to predict the pathogenicity of rare missense variants for BRCA1 and BRCA2, according to research from teams of bioinformaticians in South Korea and Qatar working independently of each other.
One finding is already leading a molecular diagnostics company that participated in the research to update a breast cancer test to improve the accuracy of its product by reducing the number of variants of uncertain significance (VUS).
Investigators led by Kyu-Baek Hwang, a computer scientist at Soongsil University in Seoul, described their method in a recent paper published in Scientific Reports. Researchers from Seoul-based molecular diagnostics and bioinformatics firm NGeneBio collaborated with Hwang and are coauthors on the study.
NGeneBio spokesperson Yunhee Lee said that the company has been working with Hwang on various machine-learning projects for several years with their comparison of gene-specific to disease-specific machine learning dating to 2020.
A separate paper from Borbala Mifsud and colleagues at Hamad Bin Khalifa University in Doha, Qatar, published last month in Physiological Genomics, looked solely at BRCA1 with a machine-learning model built with open-source Extreme Gradient Boosting (XGBoost) tools. Mifsud also led an investigation of an XGBoost-based prediction tool for BRCA2 missense variants, described in a 2022 Frontiers in Genetics article.
For their new BRCA1 study, the Qatari researchers looked at more than 31,000 previously "unreviewed" variants in the BRCA Exchange database to predict pathogenicity and assess 36 variants of uncertain significance in a Qatar-only cohort. Their XGBoost method identified 2,115 potentially pathogenic variants and predicted with more than 93 percent accuracy the functional consequence of missense variants.
The Korean team chose 1,068 rare missense variants of 28 genes linked to hereditary cancers, including BRCA1/2, and divided supervised machine learning into genome-wide, disease-specific, and gene-specific analyses to predict pathogenicity of rare missense variants of BRCA1 and BRCA2.
Data for training and validating the algorithm came from NGeneBio's BRCAaccuTest gene panel results annotated with information from public variant datasets, according to Lee.
Genome-wide supervised machine learning may include a larger dataset than other methods but "does not account for disease-specific patterns in variant pathogenicity," the authors explained. But, disease-specific supervised machine learning covers this shortfall by limiting analysis to variants known to be associated to specific illnesses. The Soongsil University team cited earlier papers showing the efficacy of this approach for cardiomyopathy, arrhythmia, epilepsy, RASopathies, and hereditary cancers.
"Compared to the disease-specific approach, gene-specific supervised machine learning is even more specific as it builds pathogenicity predictors using variants from only a particular disease gene," they wrote. "This method has the potential to perform best due to its highest specificity," though it has the smallest number of available training variants of any of the approaches studied.
In this study, the disease-specific dataset used to train machine learning was seven times larger than the gene-specific dataset. "However, we observed that gene-specific training variants were sufficient to produce the optimal pathogenicity predictor if a suitable machine-learning classifier was employed," according to the paper's authors.
"Despite the [disease-specific] data being more than seven times larger than the [gene-specific] data, selecting the appropriate machine-learning algorithm … leads to achieving optimal accuracy with just the [gene-specific] variations," Lee added.
Earlier work on gene-specific variant pathogenicity predictors for genes associated with diseases failed to compare gene-specific and disease-specific methods, according to the authors. "The comparison between gene-specific and disease-specific approaches is meaningful because there is a trade-off between specificity and training sample size," they wrote.
The Soongsil University-NGeneBio investigators filtered and then classified the variants according to whether they were known or likely to be pathogenic or known or likely to be benign. They predicted pathogenicity of rare BRCA1/2 missense variants based on five criteria: minor allele frequency, site conservation score, predicted functional-impact score, position, and a catch-all "others" category.
With BRCA1, the Korean researchers wrote that they did not find any "remarkable difference in prediction performance between gene-specific and disease-specific machine learning." While performance varied by machine-learning method, only the random-forest method produced a statistically significant difference, though, like in the Qatari study, XGBoost also worked well for this.
Disease-specific machine learning was more dependent on minor allele frequency in the training set than gene-specific learning.
"Choosing the appropriate set of training variants is crucial for developing an accurate pathogenicity predictor using machine learning," the authors wrote. "Our findings suggest that gene-specific machine learning can achieve optimal pathogenicity prediction with an appropriate algorithm, without the need to include disease-specific variants in the training set."
The Qatari researchers stuck with a single method. "We decided to use XGBoost based on a previous paper that showed that it outperformed other methods when predicting BRCA1/2 pathogenic variants," Mifsud said via email.
Mifsud said that the Korean paper "confirms that XGBoost is one of the best methods," adding that the slightly stronger showing by random-forest machine learning may have been due to the small sample size.
Lee said that the "main novelty of the [Scientific Reports] paper lies in constructing predictive models by dividing [gene-specific and disease-specific methods] for rare variant pathogenicity prediction." Previous studies largely used genome-wide data as the comparative.
Lee said that no labs have adopted this method yet, as the firm is making some additional tweaks to the software before releasing it for commercial use as well as updating the BRCAaccuTest panel. The research "complements the framework for distinguishing mutations that could potentially hold clinical importance," she said. "I am of the opinion that this approach has the potential to enhance [diagnostic] precision and contribute to improvements in accuracy" of the panel.
Lee added that the firm does not have a timetable for incorporating these new findings into the test but added that it probably will be unnecessary for NGeneBio to seek regulatory review for any updates.
Alex Colavin, clinical science and interpretation lead at Invitae, said that it "makes a lot of biological sense" to look at gene-specific predictive models.
"Even if multiple genes are associated with the same disease, there is no fundamental biological reason why the individual proteins encoded by the genes have to work in the same way," Colavin wrote in an email. "Even though BRCA1 and BRCA2 are thought of as related genes, they look very different and work very differently in human cells."
He said that Invitae has been using gene-specific modeling for several years.
"We agree that in many circumstances it can be more informative to use gene-specific models," Colavin said. However, the company often has to rely on "weaker" methods because there is insufficient data to train or validate gene-specific models for so many genes and proteins.
Though he said he doesn't consider the Scientific Reports paper presenting a novel methodology, he added, "I would consider this a validation that with enough training data it is possible to train models that learn about gene-specific molecular mechanisms in a way that can be superior to multigene models."