NEW YORK – Driven by technological advances and the desire to characterize biological systems more fully, research combining multiple types of omics data has become increasingly common. Such multiomics efforts include both individual labs as well as major initiatives like the Clinical Proteomic Tumor Analysis Consortium (CPTAC) and diagnostics companies such as PrognomiQ.
Extracting scientific insights from multiomic experiments remains a challenge, however, that is likely to grow as new technologies enable deeper and more rapid analyses and generate larger and more complex datasets.
At a basic level, multiomics experiments run into the difficulty that the tissue being sampled may not be equally useful for all levels of analysis, said Maik Pietzner, a bioinformatician at the MRC Epidemiology Unit at the University of Cambridge School of Clinical Medicine, who conducts large-scale proteogenomic experiments.
"The work we do is mostly in blood, and when we do blood proteomics, blood metabolomics, blood gene expression, and these kinds of things, [the data collected] will all cover certain areas of body physiology, but not necessarily related to each other," he said. "This is something we spotted very quickly when we went beyond one omics layer."
Pietzner cited as an example experiments aiming to combine proteomic and metabolomic measurements in blood, where in many cases the proteins found in a blood sample were not actually the proteins involved in creating and breaking down the metabolites measured in that same blood sample.
Questions around which omics measurements can be effectively made in which sample types "make everything you want to do computational-wise very tricky," Pietzner said, noting that this limits the kinds of models that can be trained using the blood-based multiomic data his lab commonly works with.
Given this challenge, Pietzner and his colleagues typically start with genetic data as what he called a "causal anchor" when integrating multiple layers of omics data.
"If we establish that a genetic variant leads to higher protein levels and also higher metabolite levels, then it is more likely that the protein is involved in the metabolism of that metabolite and possibly even in a phenotype," he said. "It's a stepwise, incremental omics integration, layering each [type of omics data] focused on the same genetic region."
Pietzner said this approach can also help researchers combine omics data from different individuals and tissue sources. He cited the example of work he and his collaborators did looking at gallstone disease, in which they combined liver gene expression data from the GTEx Portal with blood-based proteomic and metabolomic measurements, connecting changes in gene expression to alterations in the blood proteome and metabolome.
However, such clear-cut examples of multiomic data that "really nicely stacks up" are relatively rare, Pietzner said. "We need to think more about how we bridge across these different data modalities."
Pei Wang, a professor of genetics and genomic sciences at the Icahn School of Medicine at Mount Sinai, said that in her early multiomics work, she found that looking at different types of omics data was primarily useful to assess whether particular research findings were meaningful.
"We know that individual omics [data] types can be subject to biological perturbation as well as technical noise, so obtaining concordant signals across multiple modalities of data is very helpful to ensure that certain biological signals are real and significant," she said.
More recently, she and her colleagues have been using multiomics as a way to deal with the large amount of genetic heterogeneity across patient tumors. Wang, who is part of the National Cancer Institute's CPTAC initiative, said that this heterogeneity presents challenges to drug development and clinical trials as it raises questions around how to put together large patient cohorts.
By layering additional levels of omics data like transcriptomic, proteomic, and phosphoproteomic data, researchers can better identify commonly dysregulated pathways, allowing them to group patients despite the underlying genetic heterogeneity of their tumors.
"We see more consistent patterns across more tumors, which I believe is probably the more promising direction for treatment," she said.
Wang said that over the last decade, the informatics community has made great progress in developing the multivariate network analysis tools needed for such work. However, she said, new data types will require new tools. For instance, the move toward more single-cell and spatial omics studies presents new opportunities and directions for omics research but also creates a need for new data analysis approaches. "A lot of methodological development is underway," she said.
One challenge facing the field from a commercial standpoint is the wide variety of data types being used and questions being asked, which makes it difficult to create software platforms that meet the needs of broad groups of users, said Daniela Hristova-Neeley, a partner at consulting firm Health Advances who covers the life science tools space.
This means much of the multiomics data analysis work is being done by bioinformatics groups within pharma and academia who frequently develop custom approaches depending on their specific needs, she said.
"I think it's fair to say that a lot of the cutting-edge stuff does [come] from academic developers or other areas in industry, where they will make toolkits in R or Python that will take our initial data and maybe take it a step further," said Nigel Delaney, VP of computational biology at 10x Genomics. "You might want to look at SNPs from our data, or infer isoforms — something that isn't in one of our vanilla analyses but where the potential is there in our data and somebody will write a tool to exploit [it]."
Diagnostics firm PrognomiQ, which has taken a multiomic approach to the development of its early detection test for lung cancer, is relying less on informatics than the strength of the signals provided by the individual features of its test, said Philip Ma, the company's founder and CEO.
The company used a multiomic approach in part to enable a simpler informatics workflow, he said, adding that more complex AI approaches could run into trouble when moving from small controlled sample sets to larger prospective studies in patients.
"Your AI algorithm is so complex, so subtle, that it can pick up those subtle [preanalytical and analytical] differences in your study as opposed to the biology," Ma said.
PrognomiQ expects that casting a wide net in terms of different omics data types will give it a better chance of identifying analytes that produce strong diagnostic signals that are robust to non-biological sources of variation.
The company recently published a preprint providing results from a 2,513-subject case-control study evaluating the performance of its test. It developed the test using a combination of mass spec-based proteomics, RNA-seq, metabolomics, and immunoassays, identifying 6,109 peptides, 40,171 mRNA transcripts, 9,368 intronic regions, 241 metabolites, and four proteins that differed between individuals with and without cancer. The firm is now in the process of further refining an initial model that used 682 of these biomarkers to distinguish between lung cancer cases and controls. Ma said that ultimately, the company hopes to arrive at a model that uses on the order of a dozen or dozens of markers.
"We are trying to get down to a robust set of features that are less dependent on super complicated AI algorithms," he said.
Mount Sinai's Wang said that accessing large enough sample cohorts is another major challenge for her multiomic work. She noted that even within the well-resourced CPTAC initiative, researchers typically have access to only 100 to 200 samples per tumor type and suggested that collections on the order of 1,000 samples would be desirable.
"We are pushing for precision medicine, but to do that, we have to be very greedy in terms of sample size in order to make meaningful inferences," she said.
Garry Nolan, professor of pathology at Stanford University School of Medicine, said, however, that he has shifted his focus from generating new data to better understanding the meaning of existing datasets.
Last year, Nolan and collaborators published a paper in Nature Biotechnology detailing their MaxFuse algorithm, which allows researchers to integrate different omics datasets even in the absence of strongly linked features between those sets.
Such "bridge elements" are features present in both datasets that can be used to establish a relationship between the two. "It's like, this element of CD8 protein in this cell links to that element of CD8 protein in another cell," Nolan said.
Not all datasets come with explicit bridge elements, however. A spatial proteomic dataset and a single-cell RNA-seq dataset, for instance, would have only weakly linked elements in common. MaxFuse does not try to bridge between individual elements but instead integrates datasets by looking for relationships and commonalities in their overall structure, Nolan said.
"If you have a dataset that has structure, as long as it is sufficiently similar [to that of another dataset], this algorithm can find that structure, and where [the two structures] can be overlaid with sufficient confidence, you can make an assignment of one thing to another," he said.
Given the proliferation of different types of omics datasets and the impossibility of performing every kind of omics analysis in the same sample, such approaches to data integration will be key to moving the field forward, Nolan said.
Better tools for integrating omics data will mean researchers can make better use of existing datasets, he added. "We've reached a point where we have assayed enough blood cells that I don't need to assay them again," Nolan said. "There's sufficient diagnostic information in any set of measurements that I can infer many other things [from] that are behind the scenes that I know are there."
He cited as an example the relationship between the B-cell protein CD79b and cell localization, noting that "very minor changes in the level of expression of that protein determine where in the lymph node that cell is."
Expanding beyond CD79b to other proteins, "we find that these other proteins are telling you many other things about where in the tissue [the cell] is, as well as who its neighbors are," Nolan said, adding that this information, in turn, provides insight into things like gene expression and metabolite levels. "You don't need to know everything to know enough," he said.
Jesse Meyer, an assistant professor of computational biomedicine at Cedars-Sinai, suggested, however, that in areas like proteomics and metabolomics, more data is still needed. Generating these kinds of omics datasets has historically lagged behind nucleic acid datasets due to the throughput limitations of mass spectrometry-based discovery workflows.
The field is making progress in that regard but still suffers from "limited availability of large, harmonized multiomics datasets outside of just genomics," he said.
In 2022, Meyer's lab published a study in Bioinformatics detailing its MIMaL (Multiomic integration by machine learning) software, which uses machine learning and model interpretation to identify relationships between different levels of omics data.
Using the software, "we can find things that aren't just simple linear relationships," Meyer said. He noted, however, that the software doesn't pick up specifically causal relationships and suggested that the field could benefit from putting more emphasis on building models that try to make causal inferences.
Perhaps as big a challenge as integrating multiomic datasets is extracting biological insights from those datasets, Meyer said. "There are a lot of computational methods that you can run your multiomic datasets through, but then you are still stuck as a human being trying to look at tons of numbers and put them in the context of all the literature that has ever been published," he said. "That's a really hard task, no matter what."
Nolan agreed. "The next step for me is not about generating more data; it's about helping people understand the meaning of the data," he said.
To that end, he and his colleagues have begun training large language models to analyze data they have generated and place it within the context of the existing literature.
He gave the example of how such a tool might be used to analyze genes identified in a CRISPR screen: surveying the literature, creating hypotheses with references, and suggesting validation experiments, at the speed of a hundred genes per half hour. "It would take a graduate student never getting distracted about six months to do this," he said.