Skip to main content
Premium Trial:

Request an Annual Quote

Machine Learning Approach Helps Identify New Rare Variants Linked to Heart Disease


NEW YORK – Researchers from the Icahn School of Medicine at Mount Sinai used a new computational method to identify rare and ultra-rare coding variants related to coronary artery disease (CAD) risk in 17 genes, shedding light on the underlying biological processes of the disease.

The investigators applied a previously published CAD-predictive machine learning model called in silico scores for CAD (ISCAD), which captures CAD risk from known risk factors, pooled cohort equations, and polygenic risk scores to discover the new CAD-related gene variants.

Their study was published Tuesday in Nature Genetics.

ISCAD was developed as a means to measure CAD severity and provide a prognosis on a more quantitative and continuous scale, as opposed to the more binary framework of classifying patients as cases or controls, said Ron Do, professor of personalized medicine at Mount Sinai and the study's senior author.

"We use machine learning and clinical data from electronic health records trained on the case-control labels to predict coronary artery disease risk, [then] take the probabilities from those models and consider that as the machine learning-based scoring for CAD," Do said. "Then we conducted genetics on that score."

The resulting spectrum of disease risk, Do said, provides greater granularity.

Ben Omega Petrazzini, associate bioinformatician at Mount Sinai and the study's lead author, said that quantifying CAD features on a gradient is important because in CAD, individuals can go undiagnosed despite having signs of disease physiology, which are not captured in the case-control paradigm.

"We think that our score captures these different gradients of disease," he said.

The ISCAD machine learning method incorporates laboratory measurements, vital signs, medications, symptoms, and genetic data from the UK Biobank, the All of Us biobank and the BioMe Biobank, which Do said collectively helps to mitigate the influence of misclassification biases that stem from relying too heavily on diagnostic codes.

With respect to genetic data, Do and his colleagues built their ISCAD model using EHR data from 502,505 individuals in the UK Biobank, 113,575 individuals in the All of Us Research Program, and 43,744 individuals in the BioMe Biobank. They next tested the association of the ISCAD score with rare and ultra-rare coding variants on exome sequence data from 464,416 individuals in the UK Biobank, 106,926 individuals in the All of Us Research Program, and 33,573 individuals from two distinct sample populations in the BioMe Biobank.

From these data, the researchers tested associations between ISCAD and rare to ultra-rare gene coding variants, leading to the discovery of 17 genes containing variants associated with CAD risk. Some of these variants regulate known CAD risk factors such as lipid metabolism, hypertension, inflammation, and type 2 diabetes, while others are involved in biological pathways not known to impact CAD risk. These include oxysterol transportation, mitotic spindle assembly, microtubule transportation, signal transduction, and anti-apoptosis.

Petrazzini said that the group tackled rare and ultra-rare coding variants in particular because these tend to affect protein function, which can directly inform the role of a protein in disease biology.

"Additionally," he commented, "there [is] a pressing need to identify novel rare variants for CAD, in particular. Over the last decade, rare coding variant association studies for CAD have had limited success."

Seamus Harrison, VP and head of medical of UK-based precision health firm Genomics Plc, who was not involved in the study, praised it as "a useful methodological advance for genetic discovery studies" and a way to improve statistical power for rare variant analysis and discovery.

Despite such improved statistical power, Harrison commented in an email that one of the key questions in CAD research is to understand the genetic drivers of disease beyond those that are already understood, such as those that act through lipids or blood pressure.

"I think the ISCAD algorithm will capture a lot of this, and many of the results seem to reflect this," he said. "I'd love to see the algorithm further refined to really capture processes that predispose blood vessels to atherosclerosis."

Genomics Plc has been working to refine polygenic risk scores (PRS) for cardiovascular disease and to bring these into the clinical setting. Earlier this year, the company published a study evaluating the acceptability and potential utility of adding cardiovascular PRS to clinical care routines and whether its use changes treatment decisions.

"We're certainly interested in refining HER-derived [data] to improve both predictive performance of genetic risk scores and refinement of genetic discovery studies," Harrison said, "and we'll consider this in our ongoing efforts, though in some sense, my view is that there are now additional omic/imaging modalities that will allow us to really quantify specific endotypes of common diseases, which could be even more powerful."

Harrison also commented that he hopes to see ISCAD applied to a broader range of diseases.

Do said he does plan to apply ISCAD to other complex diseases related to CAD. The more immediate next steps, however, consist of applying ISCAD to much larger biobanks and assessing the functional roles of all of the variants found in the current study in CAD biology.

"There needs to be further investigation and interpretation of the features driving this model," Do said.

Right now, Petrazzini said, "we have this method that performs very well but we can't be 100 percent sure that these genes … actually represent CAD biology and not some confirmation bias in the HER structure."

Another open question is how the ISCAD model performs among individuals of different ancestries. The biobank data used in this study contained people of many different ancestry groups, and the model was trained on samples of data pooled from all of them.

Do said that his group has already begun to evaluate how the ISCAD model performs in different populations separately and that the work is ongoing. He added that the researchers are also assessing how the model performs differently as it relates to social determinants of health.

Do and Petrazzini's work adds to a fast-moving body of research focused on finding genetic associations predictive of CAD risk and prognosis. Earlier this year at the American College of Cardiology's annual scientific meeting, several academic researchers presented emerging data on CAD-related polygenic risk scores (PRS).

On the commercial side, UK-based PlaqueTec recently raised $8 million in private financing to continue funding the ongoing BioPattern clinical study of its Liquid Biopsy System, used to collect blood samples from CAD patients for analysis using a multiomic approach and plaque imaging.

Although ISCAD may have future commercial potential, Do said that he is not currently moving to patent the technique.

"We have spoken to some people [about] that, but we don't have any specific plans about patenting ISCAD," Do said.

"This is a new way to phenotype CAD risk," Do said, "but it should be complementary to ongoing efforts looking at genetic association on CAD as a binary disease. By no means do we think this should be replacing those efforts."