Skip to main content
Premium Trial:

Request an Annual Quote

Baylor Genetics Machine Learning Tool Shows Promise for Clinical Rare Variant Interpretation


NEW YORK – A new machine learning algorithm from Baylor Genetics offers the possibility of accelerating molecular diagnoses for rare Mendelian disorders by helping clinicians interpret candidate variants.

The tool, Artificial Intelligence Model organism Aggregated Resources for Rare Variant ExpLoration (AI-MARRVEL, or AIM) and published last week in the New England Journal of Medicine AI, surpassed existing algorithms when used on three independent datasets of diagnosed patients, doubling the rate of accurate diagnoses.

The Baylor researchers trained AIM on MARRVEL, a public database they had previously developed that integrates information from six human genetic databases and seven model organism databases and contains over 3.5 million variants from thousands of diagnosed medical cases and model organisms.

The team then attempted to train AIM to mimic human clinical decision-making through "knowledge-based feature engineering," in which they incorporated features beyond sequencing and phenotype data into the algorithm's processes. These extra features included disease database, minor allele frequency, variant impact, evolutionary conservation, inheritance pattern, phenotype matching, gene constraint, variant pathogenicity prediction scores, splicing prediction, and sequencing quality.

Using these data, AIM calculates its results in six different modules, each of which results in an independent analysis of a variant's pathogenicity.

Zhandong Liu, associate professor of pediatrics and neurology at Baylor College of Medicine and the corresponding author of the study, said that machine learning approaches to genetic diagnosis have had low success overall, and a key idea in designing AIM was to attempt to understand why that has been the case.

"We engineered this knowledge-driven approach, trying to combine existing knowledge and data … to teach the algorithm how a human makes decisions," Liu said.

After "a couple years of trial and error," he said, the Baylor Genetics team has finally brought AIM to a point where it outperforms most other publicly available, benchmarked tools.

In their study, Liu and his colleagues tested AIM's performance against that of four other algorithms: Exomiser and its counterpart Genomiser, LIRICAL, PhenIX, and Xrare.

They did this using exome sequencing data and human phenotype ontology terms –– a standardized vocabulary of phenotypic abnormalities encountered in human disease –– gathered from 1,102 patients in the Clinical Diagnostic Lab (DiagLab), 75 from the Undiagnosed Diseases Network (UDN), and 200 from the Deciphering Developmental Disorders project (DDD). Each dataset was used as a separate training set, and the DiagLab cohort was divided into a training set of 1,044 patients and a testing set of 58 patients.

All samples in these real-world datasets had established diagnoses, and AIM consistently ranked diagnosed genes as the number one causative candidates in twice as many cases as the other algorithms.

"This paper highlights the importance of high-quality data input (labeling) and expert feature engineering for the highest accuracy," Michael Korn, chief medical officer of Invitae, said in an email. He was not involved in the study.

Earlier this year, Invitae introduced its own Clinical Variant Modeling tool, a machine learning-based approach to gene variant interpretation, that Korn said has increased variant resolution rates by combining information on proteins, mRNA processing, population frequency, evolutionary data, and clinical data in AI models.

Alexander Lachmann, an assistant professor of computational biology at the Mount Sinai Center for Bioinformatics who was not involved in the study, praised AIM's potential utility and the "high technical expertise" shown in the paper, while noting that the manuscript's methods raised some critical questions for him, which the study left unanswered.

"On the positive side," he said, "the interpretability of features when using a decision tree gives some confidence in the predictions. And by a deeper analysis of the feature space, they show that their engineered features do all seem to contribute."

However, Lachmann highlighted the way in which the DiagLab cohort was split for training and testing as a potential drawback.

"It poses the question [of] whether the 58 samples were selected to maximize the model accuracy or [its] dominance to the other methods," he said.

Liu said that 58 individuals in the testing set were chosen for their diverse range of mutation types and disease mechanisms, which provides a "comprehensive overview" of the types of scenarios one might expect in clinical settings.

"The selection intentionally focuses on challenging cases that are prone to being overlooked," he said. "This approach is designed to rigorously test and demonstrate the limitations and capabilities of AI-MARRVEL under realistic and demanding clinical conditions."

Lachmann also cautioned that AIM appears to rely heavily on literature-based gene knowledge, which can introduce biases based on human interpretation of variants.

The Baylor Genetics team, for instance, noted in its study that relying on ClinVar information alone is insufficient for accurate diagnosis. They explored this with a variant of AIM, in which they excluded ClinVar information, noting only a "slight performance decrease" compared to the full AIM.

While Lachmann complimented the investigators for having explored this issue with ClinVar, he noted that the effects of other curated knowledge sources used by AIM should be further explored.

In terms of future improvements, Liu said that the application of large language models to variant interpretation is a rapidly developing area, with plenty of room to push the boundaries in terms of accuracy. According to the paper, LLMs "may be considered for future integration into the AIM platform."

Liu and his colleagues also explored whether AIM could be applied to finding new disease genes.

"In the clinical workflow, we're pretty much only making high-confidence diagnoses of known disease genes," said Pengfei Liu, associate professor of molecular and human genetics at Baylor College of Medicine and the study's co-first author.

"We were curious to see [if] after we have this program working, whether AI can learn from just using the known disease information to predict novel disease gene candidates," he said.

For this, the team created a version of AIM called AIM-NDG, in which they eliminated all features directly or indirectly connected to established disease databases such as OMIM, ClinVar, and HGMD, resulting in a "noticeable decrease in accuracy," according to the paper.

However, AIM-NDG correctly predicted two recently reported disease genes from data of two individuals in the UDN: MYCBP2 and TMEM161B, both of which have been implicated in neurodevelopmental disorders.

AIM is publicly available, and the Baylor Genetics team is excited for other researchers to begin using it, as their tests and discoveries will help further refine the program.

One researcher who is currently using AIM is Michael Wangler, associate professor of molecular and human genetics at Baylor College of Medicine, from which Baylor Genetics was spun out in 2015.

Wangler said in an email that he uses AIM in research related to the Community Texome Project, a program funded by the National Human Genome Research Institute (NHGRI) that aims to make genomic medicine more accessible and useful for underserved minority communities in Texas.

"We use the software primarily in cases where there is no diagnostic finding from exome [data] and we are trying to drive gene discovery by finding new gene candidates," he said.

In the meantime, Liu's team plans to further develop the tool. "We hope more and more users will join the community and give us feedback to improve AIM," he said.

According to the article describing AIM, it has several limitations, including not being able to analyze structural and copy number variants. It was also mainly trained on cases with coding variants from exome sequencing studies, limiting its ability to prioritize noncoding variants.