NEW YORK – Researchers from the University of Cambridge and elsewhere have developed genetic scores to predict complex human traits from multiomic data and validated these scores across cohorts of individuals of European, Asian, and African American ancestries.
Multiomic tools capture a range of data — transcriptomic, proteomic, metabolomic, and more — and are key to understanding the etiology of diseases. However, such analysis is expensive and time-consuming.
"Many low-resource settings in low-income countries don't have any multiomics data," said co-corresponding author Michael Inouye, director of research at the department of public health and primary care at Cambridge. "Our findings are important as they democratize multiomics data and make it possible for everyone to benefit," he added.
For their study published in Nature on Wednesday, Inouye and colleagues used data from the INTERVAL study, which collected serum or plasma samples from participants and performed assays using five omics platforms to generate proteomic, metabolomic, and transcriptomic data: SomaScan, Olink Target, Metabolon HD4, Nightingale, and whole-blood RNA sequencing with the Illumina NovaSeq 6000. These participants were also genotyped, and, after quality control, 10,572,788 genetic variants were available. Using machine learning, the researchers developed genetic scores for 17,227 biomolecular traits and 10,521 predictions that reached Bonferroni-adjusted significance.
Next, the researchers validated these genetic scores in various cohorts of people of East Asian, South Asian, African American, and European ancestries.
"Overall, we found that genetic scores developed in INTERVAL could predict the levels of Nightingale and SomaScan traits in individuals of Asian or African American ancestry, but, as expected, the performances of these scores were significantly reduced relative to European-ancestry cohorts," the authors wrote in their paper.
The researchers used their approach to generate a synthetic multiomic dataset for the UK Biobank, which was then used in a phenome-wide association study (PheWAS) using PheCodes.
They identified 18,404 associations between genetic scores for the various traits and 18 categories of PheCodes. Circulatory, endocrine, metabolic, and digestive diseases yielded the largest number of associations across platforms, according to the researchers.
The PheWAS study was also able to detect many known blood biomarkers of disease as well as other notable associations. For example, total cholesterol was significantly associated with myocardial infarction, and genetically predicted levels of IL-6R in both the Olink and SomaScan datasets were significantly associated with myocardial infarction, the researchers found.
The researchers noted that even genetic scores of apparently low predictive value may be powerful enough to detect true associations at the sample sizes of current and forthcoming biobanks.
But highlighting the limitations of the study, Inouye said that the training sets for the machine learning model need to have data representing individuals from various demographics and ancestries. "Only this will lead to more equitable analysis and findings," he added.
The researchers have compiled their findings in an open resource portal called OmicsPred. "Although OmicsPred provides a key first step towards a better understanding of the distributions of clinically or therapeutically important biomarkers under high genetic control, more research is needed to understand to what extent genetic scores for multiomic traits may one day be of clinical use," the authors wrote.