Skip to main content
Premium Trial:

Request an Annual Quote

Columbia Researchers Devise Mendelian Randomization Approach Tailored to Proteomic Datasets

Premium

NEW YORK – A team led by researchers at the University of Hong Kong and the Columbia University Mailman School of Public Health has devised a new method for Mendelian randomization experiments tailored to proteomic datasets.

In a paper published this month in Cell Genomics, the researchers demonstrated the use of the approach combined with protein structural predictions by Google DeepMind's AlphaFold 3 to identify and investigate proteins linked to Alzheimer's disease.

Mendelian randomization (MR) allows researchers to establish causal relationships between exposures and outcomes of interest by leveraging the fact that alleles sort randomly when genes are transmitted from parents to their children. It is often used in situations where a traditional randomized controlled trial is impossible or impractical.

In recent years, the production of large-scale proteomic datasets like the UK Biobank's Pharma Proteomics Project (UKB-PPP) has allowed researchers to use MR to look for proteins linked to various diseases and health outcomes.

MR experiments use what are called instrumental variables (IVs) — features linked to a particular exposure — to study the causal relationship between that exposure and various outcomes. In MR experiments looking for links between particular proteins and disease, protein quantitative trait loci (pQTLs) — genetic markers linked to the expression of a particular protein — are often used as IVs. The protein expression levels linked to those pQTLs are the exposures, and the disease states of interest are the outcomes. MR analyses let researchers determine whether there is a causal relationship between the proteins (the exposure) and the disease states (the outcomes) and to estimate the magnitude of that causal effect.

One challenge, however, is the difficulty of identifying pQTLs that satisfy the conditions required of a suitable IV, said Zhonghua Liu, assistant professor of biostatistics at the Columbia University Data Science Institute and senior author on the study.

To be used as an IV in an MR experiment, a pQTL should, one, be associated with the protein being tested as an exposure; two, not be linked to any confounders that impact the exposure-outcome relationship; and three, have an effect on the outcome only via the protein being tested as an exposure and not through any other pathways.

As Liu and his coauthors note, only the first of these requirements "can be tested empirically by selecting pQTLs significantly associated with the protein." Given this limitation, the authors add, researchers have come up with various MR approaches designed "to handle invalid IVs."

Liu said, however, that none of these approaches were developed with proteomic datasets specifically in mind. He said that this limits their utility for working with such datasets.

In part, this stems from the relatively small number of candidate pQTLs available, Liu said. "You don't typically have many candidate pQTLs that you can use [as IVs], maybe five to 10, from which maybe you select four or three or two."

Existing methods for assessing IV validity work best with larger numbers of candidates, he said, noting that for MR analyses linking genetic variation to complex phenotypes like body mass index or lipid levels, this is not an issue.

"If you look at something like body mass index, there are a large number of genetic variants that can be used as [IVs]," Liu said. The number of pQTLs, on the other hand, is much smaller, he said.

This likely reflects the fact that underlying genetic variation is consolidated into a smaller amount of protein variation at the proteome level, Liu said. He added that it might also reflect the relatively small size and limited depth of proteomic datasets compared to genomic datasets.

To address the challenge of limited pQTL candidates, Liu and his colleagues adopted what they called the Anna Karenina principle, based on that book's famous saying that "all happy families are alike; each unhappy family is unhappy in its own way." Applied to the question of pQTLs and IVs, the notion dictates that valid IVs will all provide similar estimates of a pQTL's causal effect, while invalid IVs will each provide a different estimate. Using this approach, which the authors named MR-SPI, researchers can identify valid IVs from small numbers of candidate pQTLs, Liu said.

Liu said that the MR-SPI method also differs from existing approaches in that it selects IVs for specific protein-outcome pairings — Alzheimer's disease, in the case of the Cell Genomics paper. Traditional methods typically use the same IVs for looking at casual relationships between proteins and a range of outcomes. Liu said he and his colleagues believe choosing IVs in an exposure-outcome pair-specific manner will provide more accurate results.

Commenting on the method, Maik Pietzner, a bioinformatician at the MRC Epidemiology Unit at the University of Cambridge School of Clinical Medicine, said that "the idea of selecting valid IVs based on a data-driven framework" as presented in the Cell Genomics paper "is appealing and desirable."

However, he suggested that the exposure-outcome pair-specific selection of IVs could be problematic, as it could create situations where only trans-pQTLs are selected as valid IVs. PQTLs are typically characterized as either cis — meaning that the pQTL is located close by the gene that encodes that protein — or trans, meaning it is located further away from the gene encoding the protein. While both are potentially meaningful, Pietzner, who was not involved in the Cell Genomics study, said that he and his colleagues generally avoid using trans-pQTLs as IVs because they are often nonspecific.

Applying the MR-SPI method to data from the UKB-PPP, Liu and his colleagues identified seven proteins — CD33, CD55, EPHA1, PILRA, PILRB, RET, and TREM2 — linked to Alzheimer's disease. Six of the proteins have been associated with Alzheimer's risk in previous studies.

The researchers also incorporated AlphaFold 3 into their pipeline to evaluate the potential effects of missense variations in the pQTLs selected as IVs, providing insights into protein structural changes that could be linked to the outcome being studied.

The researchers used AF3 to predict structural changes in the Alzheimer's-linked proteins they identified, but Liu said it remains unclear how those changes might impact the proteins' biological function.

Pietzner noted that AF3-based efforts have to date had limited success in determining when changes in amino acid sequence lead to the production of dysfunctional proteins.

"We’ve been hoping that [AF3] can distinguish benign from dysfunctional missense variants," he said, adding that it would also be interesting if AF3 could identify variants leading to changes in a protein's stability or its detectability via the affinity agents commonly used in large-scale population proteomic studies.

"A generic challenge with cis-pQTLs that encode missense variants is still to distinguish whether the affinity reagent is no longer able to bind or whether indeed the missense variant reduces the half-life or secretion of the protein into plasma," he said.

Liu said he and his colleagues are investigating the biological implications of some of the protein structural changes predicted by AF3.

"We are working on that, but we don't have any results to show yet," he said. "It's a very complicated question. We are working with [outside collaborators] to try to fill that gap between protein structural changes and Alzheimer's disease etiology."