NEW YORK (GenomeWeb) – Researchers from the University of Connecticut, Harvard Medical School, and The Harker School have developed a method that borrows from the field of image processing to identify differentially expressed genes in array- and sequencing-based datasets from heterogeneous samples.
According to a recently published Bioinformatics paper describing the method — which was also the subject of a poster at this year's American Association for Cancer Research meeting — EMDomics uses the Earth mover's distance (EMD) to measure differences in the expression of a given gene in two sample groups, for example, a drug-sensitive versus drug-resistant tumor type. Specifically, the method computes what's called an EMD score for the gene in each sample being compared based on differences in the distribution of the gene in each of the sample groups, then determines the statistical significance of the score by computing a permutation-based estimate of the false discovery rate.
EMDomics, according to the paper, offers an alternative to existing statistical methods for differential gene expression that are based on parametric models, which don't work well when the data does not fit the underlying model's assumptions as can be the case with heterogeneous datasets.
In evaluation tests that applied the tool to simulated and real datasets, the researchers report that EMDomics, in both cases, was better able to capture differentially expressed genes and gene sets in heterogeneous sample groups. In one test in particular that involved array-based gene expression data from ovarian cancer samples, EMDomics found more statistically significant drug-resistance genes than methods based on traditional statistical approaches for measuring differential gene expression such as SAM and Limma. EMDomics also performed better on RNA-sequencing datasets, identifying more statistically significant drug-resistance genes than edgeR, a competing method.
The issue with existing statistical methods, Andrew Beck, an assistant professor in the department of pathology at HMS' Beth Israel Deaconess Medical Center and one of the authors on the EMDomics paper, explained, is that they are based on the assumption that there is a single expression distribution for resistant cases and a single distribution for sensitive cases; or, in other words, that a single mechanism drives gene expression in the same way in sensitive cases and resistant cases. However, tumors have very high molecular heterogeneity and since the structure of that heterogeneity isn't known up front, it can't be incorporated in existing standard supervised statistical models, the researchers wrote in Bioinformatics. As a result, traditional methods that try to assume a single distribution break down, Beck said.
The difference with EMDomics is that it does not depend on specific characteristics of the distribution, for example, mean and variance, Sheida Nabavi, an assistant professor in UConn's computer science and engineering department and a co-author on the paper, explained. Rather it compares the entire distribution function of the gene's expression in two sample groups and measures the overall difference between the two distributions, she said. It computes what's called an EMD score, which is a measure of the distance between two normalized distributions that reflect the minimum costs of evening out the two distributions. A commonly used analogy for thinking about EMD involves two dirt mounds, one larger than the other; the EMD algorithm measure the amount of work needed to move enough dirt from the larger pile to the smaller pile to even things out.
The input to EMDomics is a matrix with genes arranged in rows and samples in columns — users also have to assign each group of samples to particular columns. In each of the sample groups, the expression of each gene is converted into a histogram and then the EMD score is computed between the corresponding genes in each sample. The output is a second table that lists the genes, the EMD score associated with each gene based on the difference in distribution between the sample groups, and the statistical significance value — or q-value — associated with the EMD.
In terms of how this relates to identifying differential gene expression, if for example a gene shows the same expression distribution in each of the two sample groups, then it's ignored, Beck explained. But if the gene is involved in drug resistance or sensitivity, it might have a bimodal distribution in one sample group and a different profile — a unimodal or normal distribution, for instance — in the other. In this instance, EMDomics would score that difference, provide the q-value as well as fold change information, he said. Users can then go on to explore the genes prioritized by the method in greater detail.
The method does have some limitations. For example, it requires at least 20 samples per group which is more than is required for conventional methods, Beck said. Another limitation of the method is that it only shows that a gene is expressed differently in one sample group versus another, he said. With conventional methods, a user can make inferences about whether the expression is higher or lower in one group versus another. So while "we prioritized a lot of really interesting genes that were functionally enriched with EMDomics, it's not as straightforward an interpretation as a T-test is, [for example]."
Although this first application of EMDomics focused on gene expression, the developers believe that they could use it to explore other kinds of data and they have begun looking at some of the possibilities. One of the applications the UConn and HMS researchers are considering is single-cell analysis where it could be used to compare the distribution of cell populations across patients, for example, Beck said. "We've already had some preliminary data and it's really exciting." At least one other researcher has used EMD for single-cell analysis. That approach used EMD along with two other algorithms to classify acute myeloid leukemia-positive patients and healthy donors using flow cytometry data — this was for one of the community challenges organized for the 6th Dialogue for Reverse Engineering Assessments and Methods contest.
The UConn and HMS team has also begun using EMDomics to explore associations between gene expression and risk factors such as body mass index and how these differ across groups, Beck said. The developers have also updated EMDomics so that it works with data from multiple classes at a time, he added.