NEW YORK – Researchers at Harvard Medical School and the University of Oxford have developed an AI tool that predicts the meaning of genetic variants based on their evolutionary conservation across species.
The tool, an unsupervised, deep generative model called the evolutionary model of variant effect, or EVE, could help researchers decide which variants of unknown significance to focus on by predicting which ones are more likely to be pathogenic. The researchers described the model recently in Nature.
"This data captures the result of millions of years of evolutionary experiments and selection," Mafalda Dias, one of the study's authors, said in an email. "In that sense, patterns of amino acids that are preserved, or that co-evolve, across many species are likely to be viable, and variants thereof likely to be pathogenic. By making use of deep learning, together with this new wealth of sequences, we can uncover complex patterns we would not be able to see otherwise."
The researchers, part of the international Atlas of Variant Effects, used roughly 250 million protein sequences from over 140,000 organisms, including several extinct ones, to get a sense of how constrained various sequences are across evolution. This enabled them to estimate an "evolutionary index," or the relative likelihood of seeing each amino acid variant, with respect to a wild-type sequence.
The evolutionary index assigned to protein variants closely tracked those variants' clinical labels, which assigned them as either benign or pathological.
Rather than a binary "benign or pathological" decision, EVE assigns each variant a probability — an "EVE score," running from zero to one — representing its likelihood of being harmful to health.
Correctly interpreting the meaning of genetic variation remains a challenge in systems biology and carries clear clinical consequences. Calling a benign variation pathogenic could trigger a misdiagnosis, while reading a pathogenic variation as benign could delay diagnosis and treatment.
EVE's probability scoring stems from the different ways that variants are classified, such as benign, pathogenic, or uncertain, and introduces a trade-off between predicted accuracy and variant coverage. Essentially, EVE's accuracy correlated with the number variants of uncertain classification in a dataset. Excluding more of the most uncertain variants, for example, improved the algorithm's predictive accuracy for more of the benign and pathogenic variants.
"In practice," the authors wrote, "we envision researchers deciding on specific trade-offs on a gene-by-gene and use case basis."
Dias and her colleagues applied EVE to a set of 3,219 known disease-associated human genes containing 36 million single amino acid variants. EVE outperformed several other supervised and unsupervised algorithms, correctly predicting the clinical significance of all labeled variants in all genes, including 60 clinically actionable genes.
EVE's predictions also accurately recapitulated experimental results related to five well-studied genes associated with cancer, cancer syndromes, and heart rhythm disorders.
Overall, the researchers used EVE to generate possible interpretations of approximately 27 million variants, including over 800,000 of the variants seen to date in humans. The algorithm, made freely available via GitHub and a dedicated EVE website, has already begun to attract some attention.
Genetic information company Invitae came across the EVE study preprint on bioRxiv.
"The high accuracy of its results made it clear that we could further benefit individuals tested at Invitae by incorporating this algorithm within Invitae’s Functional Modeling Platform (FMP)," John Nicoludis, a computational biologist at Invitae, explained via email.
In evaluating how to add EVE model predictions to the FMP, Invitae is clinically validating EVE model predictions, seeking to ensure that they have over 95 percent accuracy.
"Following our validation of the EVE model predictions, we plan to incorporate them into the FMP so they can be utilized during clinical variant interpretation," Nicoludis wrote. "In the coming months, we expect EVE models to contribute to more definitive classifications and benefit thousands of Invitae patients with improved test results."
Dias and her colleagues have several other projects underway both to put EVE's predictions to work and to further refine the algorithm.
Together with colleagues at the Atlas of Variant Effects, Dias is working to establish how to prioritize genes for further investigation, taking into consideration criteria such as clinical relevance and urgency and complementarity to experimental methods. This, they hope, will help researchers narrow experimental efforts to those genes most likely to deliver actionable results, particularly in more challenging experimental disease contexts.
"A great example are genes involved in many neurodegenerative diseases, which are very hard to study experimentally," Dias stated.
The team continues to update their server, aiming to generate predictions covering the entire human proteome.
"We believe we have only just scratched the surface of what can be learned from evolution about human health, and there are many aspects of these types of models that we don’t fully understand," Dias said. "For example, while our model is uncovering complex patterns of interactions between amino acids, we currently only make predictions for the effect of single amino acid variants. We are looking forward to extending our approach to predicting the effect of different combinations of variants, and dependencies with respect to the global genetic background of a person."