Skip to main content
Premium Trial:

Request an Annual Quote

Q&A: NYU’s Bud Mishra on a Data Mining Algorithm that Models Temporal Interactions among Genes


Bud Mishra.jpgComplex biological processes like cell division, metabolism, and development occur in sequence and must be synchronized to ensure that cells function properly, but reconstructing these processes from genome-wide datasets is computationally challenging.

Researchers from Virgina Tech, New York University, and the University of Milan have developed an algorithm that borrows techniques from mathematical optimization, data mining, and computational biology in order to reconstruct temporal models of cellular processes from gene expression data.

In a recent paper describing the method, called Gene Ontology based Algorithmic Logic and Invariant Extractor, or GOALIE, the researchers said that the algorithm "aims to assist an experimentalist to track which genes are under coordinated temporal regulations, how their expression persists and dynamically varies over time, hence revealing insight into the progression of events constituting a given process.”

The paper, which was published online in June in the Proceedings of the National Academy of Sciences, describes how the researchers used GOALIE to model temporal metabolic and cell cycle relationships in time-course gene expression datasets from the common yeast Saccharomyces cerevisiae.

The team used GOALIE to analyze five yeast gene expression datasets related to cell division, metabolism, and the yeast response to various chemical stresses.

This week, BioInform spoke with Bud Mishra, a professor of cell biology at the NYU School of Medicine and one of the paper’s co-authors. Currently Mishra is an investigator on a $10 million National Science Foundation Expeditions grant to develop novel computational reasoning tools for complex systems, focusing on biological organs to complex diseases as well as engineered systems.

The following is an edited version of the interview.

In your paper you mention that GOALIE builds on previous research. Could you give me some background on the work you did leading up to developing this algorithm?

This work actually goes back to the very early days of systems biology. About 10 years ago, [the Defense Advanced Research Projects Agency] started a research project to study computational approaches to understanding biological processes. We were one of the first set of PI’s [working on] creating biological models and checking the models using computational methods.

A year into the project, 9/11 happened and DARPA wanted us to get involved in understanding various [issues] related to biowarfare. In particular, they wanted us to be able to figure out what happens between host-pathogen interactions if human kidney cells are exposed to anthrax. So [DARPA researchers] created a dataset where at different time points, they took gene expressions from this host-pathogen interaction. The idea was to be able to tell what was happening in this process and of course they wanted to know how to intervene.

Various sorts of [approaches that] would be considered classical approaches now were tried. People tried to do reverse engineering, Bayesian networks, dynamic Baysian networks, but none of those really worked. During that period, I came up with this idea to use what I call Kripke models, annotate them with gene ontologies, and through that, try to come up with models. Those models were very revealing. It actually broke up the whole process into several parts [showing] what happened in the anthrax and human interactions.

[In this paper] there are all these time course datasets that were collected by different groups not coordinated with each other. One was studying yeast cell cycle; another one was studying yeast metabolic cycle; and others were studying stress response to hydrogen peroxide and menadione. [We wanted] to see if we can take those models and combine them to create a much bigger model.

One big goal [of this research] is to use it to understand diseases. We are [currently] funded by an NSF expedition grant. The part of the project that we are involved in is to study pancreatic cancer. We know that there are multiple processes involved in the progression of pancreatic cancer [such as] inflammation, apoptosis, hypoxia, anaerobic glycolysis, and there are all sorts of signaling processes.

Lots of the groups have datasets and some understanding of each of these processes but nobody really understands how these all fit together. For example, clearly hypoxia and signaling in response to hypoxia and anaerobic glycolysis should be part of one story not three different stories. How does that happen?

So this paper really is a baby step toward understanding the technology and the methodology to go after these key [questions]. An even bigger goal would be to take electronic health records and time course data on patients over a long period and do very accurate phenotyping.

What’s the timeline for the NSF grant?

We are into our first year and we have four more years.

And are the funds primarily to work on GOALIE?

[The grant] has much broader goals. It is to work on model building and model checking but there are two different kinds of applications it focuses on. One is biological and the other is engineering. In biology, it’s focused on pancreatic cancer and atrial fibrillation. So it’s a fairly large set of goals but GOALIE is one of the key components.

What’s GOALIE’s role in the study?

We are focused on pancreatic cancer so the goal is to essentially do what we did in this paper but related to various biological processes involving cancer. For example, in pancreatic cancer we are going to study inflammation, apoptosis, hypoxia, glycolysis, signaling.

[ pagebreak ]

Do you anticipate that you will have to make significant changes to GOALIE to apply it to cancer data?

I think all the theoretical bases are there but I do expect a number of new things that have to be done because I have some experience with some data we had with breast cancer. I believe cancer to be a very heterogeneous process. A lot of these cells are giving birth and dying so their populations are changing, which was definitely not the case with [yeast].

In cancer communities, people talk about the cancer subway map. But if it is such a random process why are these things so determined? It’s mostly because the individual processes interact in very specific ways so that would be another interesting thing to understand. We have not really thought about how to take into account epigenetics data or interactions with gene isoforms so there are all sorts of things like that, that we believe will force us to change some of the basic components of GOALIE and also we will need to work with much more complex experimental data.

So I think there is a lot of work ahead of us but the basic framework is [there]. Many of these ideas have been developed for almost 20 years now. The basic understanding will remain the same; it’s the details that will change.

Could you elaborate a little bit on how GOALIE works?

GOALIE’s goal is to create a phenomenological model. That means it doesn’t pretend to tell you whether this gene causes these other genes to do something or this gene is a master regulator. It also doesn’t pretend that it knows all the actors, so there could be microRNA, or a genetic effect, there are all these invisible things in the cell that [GOALIE] doesn’t know about.

Phenomenologically, what the biological process is doing, is trying to take the system through many different states. Once it’s in a state, it does some complicated choreography and then moves out of that state to a different state. In each state I know that the system is trying to achieve a set of functions so I am going to be able to label each state in terms of the functions that are being done in that state. If you’re a logician, you would call this a Kripke model. If you are a computer scientist, you would call it a labeled finite state automaton model.

So we want to create just that kind of cell cycle model, or in metabolic cycle, the reductive state and oxidative state and then some transitions. So how do we do that? The idea is that if you give me lots of data that observes RNA expression over time, I am going to take all of that and compress it into a small set of models. So if you give me cell cycle data, I should be able to create the cell cycle model.

So there are essentially three steps. One is to find the critical transition points where I am going from one state to another. I need to find those and then I can take them up into states and state transitions, and then, depending on which genes are active together, I can go into gene ontology and find the most important genes in those states.

But the data I am going to get will have all sorts of biological and instrumental noise. We use rate-distortion theory to remove the signal or the distortion. That says that I am going to break up into states so that when I go through my model and create a path, that path should be very similar to the experimental data. So the distortion between the experiment and what my model creates should be as small as possible or should be within some tolerable limit.

The simple thing to remember is that there are some statistical criteria that [are] determined by this information theory on distortion. And I can take those statistical criteria and turn this into a complex optimization problem. And then we solve the optimization problem using a very classical optimization theory.

There are other data-mining tools out there. Did you compare GOALIE to any of these other tools?

Yes, in the paper there is some comparison. There are tools to do time course analysis. Within computer science, there are two strands of thought on how to think about time and computation. One strand uses logic and model theory to understand computational processes. Another is machine learning and artificial intelligence that tries to understand the statistical connections and correlations; these are things like graphical models. There are a lot of tools in systems biology that use Bayes nets and dynamic Bayes nets and do time analysis. There are also tools to understand basically how I take a set of genes or clusters and connect them to the ontologies.

As far as I know, there is no tool that puts you into this framework of trying to create a dynamic temporal model right out of the data that can then go feed into this logical analysis. This is the only tool that I know. And we are able to extend this so that you can also do causality analysis.

It’s more of a logical theory. In most of the analysis that you can do in Bayes models or graphical models, you can talk about causality, for example. You can say things like A causes B. But you can’t really say a complex set of activities causes another complex set of activities.

So, for example, if I want to say that if you repeatedly commit crimes in California and you are convicted at least three times, that will cause you to go to jail without a chance of parole. The cause and effects are fairly complex temporal statements in this. We can capture that.

What are your next steps in terms of the pancreatic cancer research?

Well the next goal is to, with my biological collaborators, design really well thought-through experiments so that we can apply this [algorithm]. If the experiment is not well designed, if you don’t sample cells in the right way or synchronize them, the data is useless.

On the other hand, even with all the next-generation sequencing technology, it’s still an expensive experiment so you don’t want to waste your money trying to do too much. So the next goal is to design an experiment and try to keep the cost below $100,000 or $50,000.

Can other researchers access GOALIE for their projects?

Yes. As far as I know there are three different versions of GOALIE because there are several collaborators; one in Virginia and one in university of Milan. They have their own versions of GOALIE. So it is available from these groups.

The Scan

Wolf Howl Responses Offer Look at Vocal Behavior-Related Selection in Dogs

In dozens of domestic dogs listening to wolf vocalizations, researchers in Communication Biology see responses varying with age, sex, reproductive status, and a breed's evolutionary distance from wolves.

Facial Imaging-Based Genetic Diagnoses Appears to Get Boost With Three-Dimensional Approach

With data for more than 1,900 individuals affected by a range of genetic conditions, researchers compared facial phenotype-based diagnoses informed by 2D or 3D images in the European Journal of Human Genetics.

Survey Suggests Multigene Cancer Panel VUS Reporting May Vary Across Genetic Counselors

Investigators surveyed dozens of genetic counselors working in clinical or laboratory settings, uncovering attitudes around VUS reporting after multigene cancer panel testing in the Journal of Genetic Counseling.

Study Points to Tuberculosis Protection by Gaucher Disease Mutation

A mutation linked to Gaucher disease in the Ashkenazi Jewish population appears to boost Mycobacterium tuberculosis resistance in a zebrafish model of the lysosomal storage condition, a new PNAS study finds.