CHICAGO – Researchers at Baylor College of Medicine in Houston have developed a novel computing method that uses graph analysis to identify associations between genes and complex, polygenic diseases.
The algorithm, called GeneEMBED, which stands for Gene-embedding-based Evaluation of Disease-gene Relevance, was able to identify 143 genes that "interacted significantly" with known Alzheimer's disease markers by examining Alzheimer's Disease Sequencing Project (ADSP) cohorts. The researchers then validated these genes in vivo with fruit flies.
Notably, the Baylor investigators identified PLEC and UTRN as "novel and unsuspected candidates" for Alzheimer's. They described their work in a recent paper in Cell Genomics.
"Importantly, many GeneEMBED candidates are druggable with already approved compounds. Overall, these results point to new targets for therapeutic development in AD and broadly support a novel and general paradigm to interrogate other complex genetic diseases," the authors wrote.
They called GeneEMBED an advance over an existing genome-wide association study analysis tool called Multi-Marker Analysis of GenoMic Annotation (MAGMA) because the new algorithm was able to find gene overlaps between cohorts of markedly different sizes.
The Baylor team tested the algorithm on the Alzheimer's Disease Sequencing Project's Discovery and Extension cohorts, totaling 6,138 individuals. Specifically, their research looked at functional perturbations induced by coding variants in cohorts of 5,169 exomes and 969 genomes.
Earlier GWAS cited in the paper found about 40 genetic loci for late-onset Alzheimer's, but that only accounts for one-third of heritability. "While there are many explanations for this 'missing heritability' problem, which is seen across complex diseases, an attractive hypothesis suggests that genetic interactions may be a culprit," the researchers wrote.
Previous approaches relied on expression data or are specific to somatic mutations, and thus are not well suited for germline GWAS, according to the authors. GeneEMBED looks at differential perturbation patterns of gene interactions by annotating molecular networks with information about protein coding variants.
While Alzheimer's disease was the proof of concept for GeneEMBED, the researchers called their computing method a "general approach that should be broadly applicable to identify genes relevant to risk mechanisms and therapy of other complex diseases."
The Baylor team chose Alzheimer's as the use case because it is representative of such diseases and it is growing in prevalence as the population ages, according to corresponding author Olivier Lichtarge, chair and director of the Computational and Integrative Biomedical Research Center at Baylor.
All of the authors are at Baylor, though they cross departments. Notably, Juan Botas and Ismael Al-Ramahi are neurodegeneration investigators at the school's Neurological Research Institute who regularly study fly homologs of human genes.
The computing method is based on graph analysis, which can be described as a set of branches originating from the genomic sequence and joining back together at a different point that indicates a biomarker.
First author Yashwanth Lagisetty, a Ph.D. student at Baylor, called graph analysis a field of geometric deep learning.
"It's a way to coalesce all of these qualitative and quantitative traits that genes might have in a network like their locations or their mutational burden," Lagisetty explained. "Can we meld that into one mathematical object using this deep learning?"
GeneEMBED seeks to identify genes that are perturbed in Alzheimer's cases by combining network biology with unsupervised deep learning. In the paper, the Baylor researchers demonstrated how the algorithm can generate embeddings for three different types of databases of protein-protein interactions and two types of variant impact predictors.
Lichtarge explained that "networks" describe the inputs to the analytics, in this case the genome and exome sequencing. "From that, we derive their variants with respect to the standard reference genome, and then we … compute the likely impact of those variants," he said.
Alison Goate, director of the Ronald M. Loeb Center for Alzheimer's Disease at Icahn School of Medicine at Mount Sinai in New York, said in an email that while there are several other computing methods that prioritize genes within genetic loci to identify candidate disease-causing genes, the GeneEMBED work represents an "important step forward" because it identifies and validates disease-associated molecular networks. "As more and more omics data become available from relevant tissue and cell types, the precision of these methods will improve," said Goate, who was not involved with the study.
The Baylor researchers decided to focus on gene embedding because they wanted to compare not just variants but the functional impact of variants. Lagisetty, who designed the experiment, said he was looking for an efficient, accurate method of accounting for gene-gene interactions in the context of functional variants.
"The motivation for doing any sort of network analysis … is that fundamentally there's a large genetic component to a lot of these complex diseases, and a lot of that genetic component we don't quite understand very well," Lagisetty said. "To account for gene-gene interactions, we really need to look at the network interactions that genes have."
Lichtarge said that nobody else has developed such a method for variant identification simply because graph representation is still a "developing" area of computer science. "To figure out how to apply it in the context of biology demands that you have a good representation for the impact of coding variants," which he said is what his laboratory specializes in.
"I think that given time, if we hadn't done that, somebody else would have," Lichtarge said.
A key difference is that the Baylor researchers chose to home in on germline gene embedding. "Being able to compare graphs like that between cases and controls requires that you have really good embeddings," Lagisetty said.
According to the paper, the integration of network data makes GeneEMBED unable to produce useful predictions when there are no interactions or genes related to gene pathology. Another limitation is that they only tested the software with coding mutations. "Extending GeneEMBED to incorporate noncoding data may be a fruitful future direction," they wrote.
Lagisetty said that the research team is now attempting to extend the algorithm to noncoding mutations, first by figuring out how to quantify the impact of variants noncoding regions of the genome.
"The basic framework is almost ready to absorb noncoding information," Lichtarge said. "But the hurdle right now is the annotation of the noncoding information."
The researchers also have not yet extended the technology to other complex diseases. Lichtarge said that his team is open to collaborations with other labs beyond Alzheimer's disease.
Lichtarge said that future application of GeneEMBED does not have to be on a disease. "It can be also traits of interest," such as which genes regulate specific physical or biochemical aspects of an individual's makeup.
Lichtarge, who is trained in internal medicine and endocrinology, said that his lab wants to make advances on polygenic and omnigenic diseases. "We're really hoping that this can help inform precision medicine in the real world," he said.