NEW YORK (GenomeWeb) – Researchers from the University of Cambridge and the Francis Crick Institute in the UK have used machine learning to predict the metabolomic profile of yeast strains based on their enzyme expression levels.
The study, published this week in Cell Systems, marks one of the first successful attempts to predict an organism's metabolomic status based on genomic or proteomic information, said Markus Ralser, a group leader in the department of biochemistry at Cambridge and the senior author on the paper.
Ralser said he and other researchers have been pursuing the goal of predicting cell metabolomes based on gene or protein expression for some time.
"Metabolism plays a role in so many things — metabolic diseases, cancer — and so if you have available a transcriptome or proteome, you would like to [be able to use that] to know what the metabolic state of the cell is," he said.
This has proved challenging, however, due to the fact that there is not typically a one-to-one relationship between the expression of enzymes and their metabolites, Ralser said.
"Every enzyme is connected to many metabolites and many metabolites are connected to many different enzymes, and so simple correlation statistics really struggle to find the correlation between gene expression and metabolism," he explained.
To address this challenge, the researchers used a machine-learning approach, developing an analysis method comprising 12 machine-learning algorithms that they trained and tested on yeast enzyme expression data to develop models for predicting metabolite levels.
They measured enzyme expression in 97 yeast strains, each with a different protein kinase knocked out. Ralser said he and his colleagues chose to perturb the yeast at the kinase level because kinases are known to influence metabolic processes, yet they are not themselves metabolic enzymes.
"There is a tight connection between kinases and metabolism," he said. "At the same time, we couldn't work with metabolic enzymes themselves, because if [the system] was perturbed [at the level of] metabolic enzymes, that would change the topological organization of the metabolic network."
Additionally, there are enough kinases in the yeast genome to provide the large amounts of data required for the machine learning approaches used in the study, he said. The researchers measured the proteomes of the 97 yeast knockouts plus controls in triplicate, making for 397 total samples. From each of these, they collected expression data on 286 metabolic enzymes.
"You need a large dataset and the dataset needs to be very systematically [generated]. It needs to be very precise," Ralser said. "Because many of the concentration changes which we show to be important are not huge concentration changes, but they're significant, consistent, small or medium changes. And if you put all of them together, you start to capture the connections between enzyme expression and metabolism."
To generate highly reproducible proteomic data across the 397 samples, the researchers used a SWATH mass spec approach and Biognosys' Spectronaut data analysis software. One of the primary advantages of data-independent acquisition mass spec approaches like SWATH is their ability to produce consistent quantitative data across many samples with relatively high throughput.
"We knew from the start that we would need to have very precise measurements with very few missing values and very low [coefficients of variation]," Ralser said. "At the same time, our sample amounts are large. SWATH was a very good technology approach for achieving this."
The researchers also benefited from the fact that the metabolic enzymes they were most interested in measuring for their models are relatively highly abundant, Ralser said.
To collect metabolomic data, the researchers used SRM mass spec assays to measure 46 metabolites.
Using this data to model changes in the metabolome due to changes in enzyme expression, the researchers found that the metabolome predictions correlated with experimental values with a cross-validated R2 of 0.55. This indicates that "more than half of metabolite concentration regulation … is attributable to changes in enzyme abundance," they wrote.
Ralser said the ability to link the metabolome to the proteome should be useful in unraveling the causes of various metabolomic phenomena.
"We can measure a metabolome, but we don't necessarily understand what the enzymes are, what regulatory steps are involved in achieving that metabolome," he said. "So the reason we created this model is to understand how we regulate the metabolome."
Using machine learning to model the relationship between enzymes and the metabolome, the researchers are also able to target a metabolic change of interest and work back to the enzymes that are likely regulating that change, he noted.
"We dissect, basically, the mechanisms that are between gene expression and the metabolism," he said. "And because all of those mechanisms are not working enzyme-by-enzyme, but are working over broad expression patterns that join together, there is no other way to do this. If you just look one enzyme at a time, you won't be able to capture those mechanisms. That's the great thing about the machine-learning approach.
Ralser said the researchers have begun using similar methods in human samples to see if they could be helpful for prediction or early detection of metabolic diseases, like diabetes.
They are also working to expand their dataset beyond kinase knock-out strains to knock-out strains for every gene in the yeast genome and are collecting data on a larger set of metabolites, as well, he said.
"We want to get a much more comprehensive picture of the regulation of metabolism, not just the [role of] kinases," he said, noting that they hope to have this expanded dataset completed by the end of the year. "There are many new things to be discovered about the regulation of the metabolome, and this kinase study was a crucial starting point because we showed that we can do it."