Three recently published research studies highlight the importance of linking genetic and protein expression data in the context of disease phenotypes in order to identify actionable protein targets for drug development.
While many drugs are developed based on genomic findings, it’s common for these drugs to fail in the clinical phase due to their inability to provide the expected results for successful treatment. This is because most drugs target proteins, not genes.
Although gene transcripts encode the information to make specific proteins, the amount of gene transcript does not always correlate with the amount of protein. On the other hand, focusing on protein levels alone can’t indicate whether any biological changes are a cause or consequence of the disease being studied.
Researchers are finding that a combined approach that integrates genomics with proteomics can overcome these challenges.
The Hunt for Disease-causing Proteins
In a recent study published in Nature Metabolism, an international team of researchers led by Anders Mälarstig, Director of Human Genetics and Computational Biomedicine at Pfizer and researcher at the Karolinska Institute in Solna, Sweden, combined genomics and proteomics to map genetic variants that control the levels of 90 human cardiovascular proteins. They analyzed genome-wide sequencing data collected from 15 different studies spanning a total of 30,931 participants and compared this data to quantitative readouts of protein expression levels. By interrogating the two datasets, the researchers identified genetic variants that control protein abundance known as protein quantitative trait loci (pQTLs).
High-throughput quantification of protein levels is tricky. Scientists often estimate protein levels using Western blots or mass spectrometry, neither of which is conducive to measuring multiple proteins and large sample cohorts at the same time. Olink proteomic assays can overcome these shortcomings, however.
To simultaneously quantify protein levels from numerous cardiovascular proteins, Mälarstig’s team used the Olink proximity extension assay (PEA) panel. PEA panels contain protein-specific antibody pairs for up to 92 different proteins. Each antibody is linked to a DNA reporter molecule. Upon incubation with the sample, the paired antibodies bind to their target proteins and the DNA reporter molecules hybridize to form a new PCR target sequence for real-time PCR or NGS readout.
Mälarstig and colleagues mapped a total of 451 pQTLs linked to the regulation of 85 cardiovascular proteins. The large sample size enabled them to identify large numbers of both cis-pQTLs (with genetic variants located very close to the gene encoding the protein of interest) and trans-pQTLs (genetic variants located far away from the gene encoding the protein of interest). In total, the team identified cis-pQTLs for 75 proteins and trans-pQTLs for 73 proteins.
But pQTLs alone cannot predict the potential of a protein to play a role in disease unless both the genomic and proteomic components can be robustly linked to disease phenotypes.
To achieve this, the researchers employed Mendelian randomization, a statical framework of genetic, protein, and phenotypic data that predicts whether a protein plays a causal role in disease. It is essentially the equivalent of running multiple clinical trials without human participants or confounding factors.
Mälarstig and colleagues applied Mendelian randomization to the identified pQTLs and used biobanked genetic data on 38 common diseases to assess pQTL causality. They identified 25 proteins, including 11 previously unidentified proteins, as causal in a variety of human diseases, including rheumatoid arthritis, osteoporosis, and diabetes.
Linking Protein to Phenotype in a Range of Diseases
The study is one of several recent efforts by researchers using pQTLs and Mendelian randomization in a systems biology approach to identify causal proteins underlying an assortment of human diseases.
In a study published in July 2020 in PLoS Genetics, a team of researchers led by Chris Haley, professor of Genetics and Molecular Medicine at the University of Edinburgh, identified 154 genetic variants controlling the levels of 249 circulating proteins in the blood. They collected plasma samples from two separate European cohorts that were previously genotyped, and detected protein levels using an approach similar to the one used by Mälarstig and colleagues.
In addition to cardiovascular protein panels, Haley and his collaborators assessed the levels of key inflammation proteins using the Olink Target 96 Inflammation panel. In contrast to Mälarstig, Haley’s team focused on identifying locally acting cis-pQTLs. They reasoned that such variants provide information on directly druggable protein targets. They identified 64 cis-pQTLs shared between the two cohorts of participants.
Using Mendelian randomization, they assessed the potential of these pQTLS to be directly involved with 846 diseases and traits. In total, they detected 38 pQTLs involved in 509 different human disease traits including schizophrenia and cardiovascular disease.
Integrating Proteomics and Epigenomics
The hunt for variants that control disease-causing proteins is not limited to genomics studies. Researchers are increasingly aware that epigenetic factors play an important role in changing the expression of genetic variants. Downstream, this affects protein abundance and disease. The push to identify causal variants is part of a systems biology approach that investigates interactions between the genome, proteome, and environment as well as their influence on phenotype, in this case, disease pathogenesis.
A study published in July 2020 in Genome Medicine explored this idea further. A team of researchers led by Riccardo Marioni, professor of Genetics and Molecular Medicine at the University of Edinburgh, performed both genome-wide and epigenome-wide association studies on the levels of 70 circulating inflammatory proteins from 1,017 healthy older adults. They studied samples from the Lothian Birth Cohort, a longitudinal aging study of individuals born in 1936 in Scotland.
The team identified 13 genetic variants associated with 13 proteins and three epigenetically modified sites across three proteins. Interestingly, genetic variants and epigenetic methylation patterns explained almost equal parts of the variation observed in protein levels: 45 percent and 46 percent, respectively. Combined genetic and epigenetic variants accounted for 66 percent of the variance in protein expression, hinting at the cumulative importance of assessing both parameters in disease. By plugging these findings into a Mendelian randomization framework, the authors implicated two proteins in inflammatory bowel disease and one protein for Crohn’s disease.
System-level studies such as these require mountains of data to determine significant associations. Researchers across the globe now share their respective data to create a cumulative resource for assessing disease pathogenesis. One such effort is the SCALLOP consortium, an ongoing international research effort that aims to identify novel molecular connections and protein biomarkers that cause disease.
The growing consortium is comprised of 29 principal investigators from 24 different research institutions, who are compiling one of the largest resources to date for patient data and control samples. Investigators, such as those in the SCALLOP consortia, map pQTLs for hundreds of genetic variants and proteins to uncover new and improved effective drug targets.