NEW YORK – Researchers with the UK Biobank Pharma Proteomics Project (UKB-PPP) have profiled the plasma proteomes of more than 54,000 individuals, integrating that protein-level data with genomic information from the same subjects.
The effort, detailed in a BioRxiv preprint published this month, generated one of the most expansive datasets to date linking protein expression back to its genetic influences and provides a wealth of information for researchers in drug development and the life sciences more generally.
"To actually have a large-scale resource that lays out, in this huge collection of individuals … what the genetic architecture looks like … it's fantastic," said Jonathan Long, an assistant professor of pathology at Stanford University.
Long, who is not a part of the UKB-PPP, uses plasma proteomics and other approaches to study energy metabolism, focusing on metabolic hormones in blood. He said the data generated by the project would provide "very fertile ground" for his lab's investigations.
One of the primary aims of the UKB-PPP effort was to identify protein quantitative trait loci (pQTLs) — links between genetic variants and plasma protein levels. Such work has become feasible in recent years with the development of affinity-based platforms from Olink and SomaLogic capable of measuring thousands of proteins in plasma samples from tens of thousands of individuals, bringing the depth and throughput of proteomic studies to a level where they can be meaningfully integrated with large-scale genomic datasets.
PQTLs are typically characterized as either cis — meaning that the pQTL is located near the gene that encodes that protein — or trans, meaning it is located further away. A cis pQTL in many cases reflects the influence on a protein of the gene that codes for it, while trans pQTLs may reflect other phenomena, such as changes to other proteins that interact with the target protein or are in a signaling pathway with it. The hope is that pQTLs can help researchers map the connections between genetic variation and protein expression changes and, ultimately, disease.
The UKB-PPP team profiled the plasma proteomes of 54,306 individuals using Olink's Explore 1536 platform to measure 1,463 unique proteins per individual. Using this data for pQTL mapping, they identified 10,248 pQTLs, 1,163 of them cis and 9,085 of them trans.
The researchers noted that they observed the number of cis pQTLs plateaued at roughly the number of proteins measured as they reached sample sizes of around 5,000 participants, but the number of trans pQTLs continued to grow and showed no sign of plateauing even as the study hit 54,000 subjects. Additionally, they said, as the sample size grew larger, they observed more "genomic regions harboring associations with multiple proteins … indicating greater detectability of pleiotropic loci at increased study sizes."
This suggests the UKB-PPP dataset could provide deeper insight into less direct mechanisms impacting protein expression, said Maik Pietzner, a bioinformatician at the MRC Epidemiology Unit at the University of Cambridge School of Clinical Medicine.
"Previous studies have been very much powered to see [cis pQTLs], but what this paper really contributes is that we are able to see more of what is encoded elsewhere in the genome," he said. "I think that is probably the strongest contribution of this study compared to previous [pQTL] studies."
Pietzner, who was not involved in the UKB-PPP effort, was first author on a paper last year in Nature Communications in which he and his colleagues measured 4,775 unique proteins in 10,708 subjects using SomaLogic's SomaScan platform and 1,069 proteins in 485 subjects using Olink's platform. They identified 547 pQTLs, 108 of which were unique to Olink and 91 of which were unique to SomaLogic.
Christopher Whelan, associate director and head of translational genetics at Biogen and senior author on the preprint, said that he expects the dataset will prove useful for drug target discovery as well as biomarker discovery and basic inquiries into disease biology.
Benjamin Sun, associate scientific/medical director at Biogen and first author on the study, noted that the data presented in the preprint "is just the tip of the iceberg, really," adding that it is now available to the broader scientific community to bring specific questions and additional modes of analysis.
"The scope is massively beyond what we could represent in one preprint," he said.
While the preprint did look at some disease-linked pQTLs, including gene-protein relationships involved in COVID-19 and cardiovascular disease, Whelan said that, given the pre-competitive nature of the UKB-PPP effort, the researchers are leaving more in-depth exploration of such relationships to outside researchers and participating pharma firms.
He noted that "on Biogen's end we are certainly looking at disease associations involving the central nervous system, for example."
Stanford's Long said the UKB-PPP dataset provided a starting point for investigations into molecules of particular interest to his lab.
"We have single, high-priority plasma proteins that we are really interested in understanding," he said. "And so, now, if we look up something we are interested in, we are getting not only a genetic signature at that locus, but you are also getting all these trans loci that are somehow determining the levels of that molecule. And that becomes very fertile ground for us to do basic science investigations of the regulation of that pathway."
"Genetics captures the blueprint of health and disease, and proteomics captures the end products of that blueprint, so in many ways they are two sides of the same coin," Whelan said. "Proteomics is in many ways the technology that I think we need to help unlock the full potential of genomics."
Proteomics has only recently achieved the scale and throughput to make such population-scale proteogenomic studies possible, and, in fact, existing technologies are able to interrogate only a small sliver of the plasma proteome. Olink's current Explore platform measures around 3,000 proteins, while the current version of SomaLogic's SomaScan measures around 7,000 proteins. High-throughput mass spec workflows, meanwhile, typically top out at around 500 proteins, though studies indicate that Seer's Proteograph platform can up that to the 1,500 to 2,000 range. Given that there are roughly 20,000 protein-coding genes in humans, existing assays capture only a fraction of the proteome.
Also significant is the fact that many proteins exist in multiple forms, exhibiting alterations like amino acid variants or truncations or post-translational modifications. Each of these so-called proteoforms may function differently in the cell, meaning that, ideally, researchers would know not only what protein they are looking at but also what specific proteoforms.
In the case of pQTL studies, the existence of different proteoforms may lead to situations where a pQTL mapped via two different platforms shows different effects depending on the platform used. For instance, in the Nature Communications study, Pietzner and his colleagues observed that a missense variant in the gene PILRA was inversely associated with PILRA protein expression as measured by the Olink platform, whereas the same genetic signal was positively correlated with two proteoforms of the same protein measured by the SomaScan platform and was not associated at all with the canonical version of PILRA measure by SomaScan. The authors suggested that this discrepancy stemmed from differences in the binding of the two platforms' reagents to the different PILRA proteoforms.
The Nature Communications study showed "that there is immense benefit to combining proteomics platforms," Whelan said, adding that he and his colleagues aim to perform a multi-platform study in a portion of the UK Biobank sample set. They plan to use the Olink Explore platform, SomaLogic's SomaScan platform, and three different mass spectrometry setups to analyze 1,250 samples.
This effort will help the researchers evaluate how much of the human proteome they can capture by using the multiple platforms and assess the performance of newer mass spec-based workflows.
Whelan said regarding mass spec that he and his colleagues are particularly interested in the technology's ability to identify specific proteoforms and post-translational modifications, "which can be very important for neurological conditions, in particular, which we are very interested in at Biogen."
Pietzner said he would also like to see the field expand beyond the mainly European population analyzed in the UK Biobank project (and similar large-scale efforts by Decode Genetics) into other ethnicities. He noted, though, that while other countries have similar biobanks, few if any offer the UK Biobank's combination of size and ease of access.
"That is why this resource is so fruitful and important for the community," he said.
The next phase of the UKB-PPP effort will use Olink's Explore 3072 assay, which will boost total protein measurements to 2,926 proteins. The researchers also plan to analyze 4,500 samples collected roughly 10 years after the initial sampling, which will allow them to collect longitudinal data and assess how proteins change over time due to aging, disease, and other factors.
Whelan said he expects data from this stage of the project to be available in late 2023.