A research team from the Howard Hughes Medical Institute and the University of California, Santa Cruz, is taking a pathway-centric approach to integrating genomic data in order to improve the classification of cancer patients into clinically relevant subtypes.
A paper in the June issue of Bioinformatics describes how the approach, called Pathway Recognition Algorithm using Data Integration on Genomic Models, or PARADIGM, identifies alterations in pathways that form the genetic roots of cancer.
“Analyses of current data sets find that genetic alterations between patients can differ but often involve common pathways. It is therefore critical to identify relevant pathways involved in cancer progression and detect how they are altered in different patients,” the authors wrote.
PARADIGM uses probabilistic inference to predict the degree to which a pathway’s activities are altered in a particular patient. The approach integrates different types of genomic measurements for a single patient — such as gene expression data, copy number data, and the like — in order to infer the activities of genes and products for a given pathway in the National Cancer Institute's Pathway Interaction Database.
The software produces a set of integrated pathway activities, or IPAs, for individual patients that can be used in place of the original genomic datasets to identify associations with clinical outcomes.
The authors claim that the pathway-based approach "improves our ability to classify samples into clinically relevant subtypes." In particular, clustering cancer patients with PARADIGM "revealed patient subtypes correlated with different survival profiles," whereas clustering the same samples using expression data or copy-number data alone "did not reveal any significant clusters in the dataset."
In the study, the researchers also compared PARADIGM with two other tools: Signaling Pathway Impact Analysis (SPIA), a software system developed by researchers at Wayne State University that also analyzes pathway activity; and Gene Set Enrichment Analysis (GSEA), developed by the Broad Institute. According to PARADIGM’s creators, the algorithm identified pathways involved in cancer better than SPIA and GSEA did and with fewer false positives.
BioInform spoke with David Haussler, director of the Center for Biomolecular Science and Engineering at UC Santa Cruz and a co-author on the paper, earlier this week. The following is an edited version of the interview.
[ pagebreak ]
There are a lot of diagnostic gene expression analysis/protein analysis tools on the market that are used to classify patients into cancer/non-cancer categories. Your work suggests that tools that analyze signaling pathways are better. Can you provide some background as to why that’s the case?
The current methodologies that are used by diagnostic applications have two limitations. One, they typically only use one type of data, normally gene expression data from a microarray; and secondly, the models that they use do not model the interactions between the genes in the pathways.
We know that there are a number of important pathways in cancer, the PI 3-kinase pathways, the pathways associated with p53, etc. The genes within these pathways have a pattern of interaction between them so that if one gene is inhibited, it has a particular effect on the other genes. If that gene is over-expressed, it’s going to have a different effect on the other genes. So with [these] pathways, you can actually model the biological logic underlying the interactions in the sample.
We were able to use more than one source of data and we were able to model the interactions. Those are the two main advantages of pathway modeling.
Using pathway data is a useful bioinformatics tool, but with your research you have taken it a step further and applied it to patient data. What did you have to do to get the tool to that point?
We don’t use [PARADIGM] for actually treating patients at this point; let me be clear on this, it’s strictly a research project at this time. We aren’t at that stage. But in the context of research, we can talk about some of the research projects [such as] the Cancer Genome Atlas and other projects that we’ve looked at. It took an enormous amount of work to get [PARADIGM] to the point where we could actually model the different pathways that were relevant to cancer and interpret what it was telling us.
Could you elaborate a little bit on what you had to do to get PARADIGM to model these pathways?
[We worked] with Laura Esserman at [University of California, San Francisco]. She is a brilliant cancer surgeon in the breast cancer area and she is running a very large national trial with data from many centers called I-SPY [Investigation of Serial Studies to Predict Your Therapeutic Response with Imaging and Molecular Analysis. See BioInform 04/03/2009].
The first thing we did was build a browser for the data. We have the UC Santa Cruz Genome Browser but it’s not set up to handle large numbers of expression datasets, copy number variation datasets, and so forth specifically in the cancer area, organized according to tumor samples. So we built a browser called the Cancer Genomics Browser specifically oriented towards these large-scale patient-specific or tumor-specific datasets with these types of data.
Then, as we were analyzing the data, once we could visualize it, we could start to look for particular ways that we could extract implications from the data. That led to the development of PARADIGM. The lead person involved in that was Joshua Stuart; he is a colleague here at Santa Cruz. Charles Vaske and Stephen Benz, who are graduate students, created a pathway-based tool that could then interpret all of these data that you see on the UCSC cancer genomics browser.
[ pagebreak ]
What does PARADIGM do and how does it work?
PARADIGM is based on a probabilistic model called a factor graph. [It] is a generalization of Bayesian inference networks. It’s a probabilistic model; you can think about it as a graph. In this case, the nodes would represent activities within the cell. There’s a set of nodes for each particular gene. For example, if you look at the nodes for p53, there’s a node that says whether p53 is present or absent in the genome. There’s another node for whether it’s making a transcript or not. Then there’s a node for whether that transcript is being translated into a functional protein or not. Then there is a node that says whether that protein is modified appropriately or not.
So the nodes for each gene follow the central dogma: DNA makes RNA makes protein, and then we have post-translational modification. Since p53 interacts with other nodes, there are possibilities for then having edges in the graph that lead from these functional p53 proteins to other genes in the cell. Once you put these nodes and edges together, you have a whole graph that models the interactions and dependencies.
In particular, not only are there dependencies between genes but [also] between the nodes within a gene. Obviously, having the gene in the genome is a requirement for making the mRNA so there’s a dependency edge between the node that represents whether the gene is there or not and the node that represents whether its making mRNA or not and so forth.
So we have a bunch of nodes to represent biology. In addition to these biology-modeling nodes there are data ports. These are special kinds of input nodes where you can put in data. If we have genomic data for a gene, then there is a node that represents the value of that genomic data. It may be copy number data, for example, from a copy number chip. We can look at that data in the region of the p53 gene, for example, read that data, and we’ll get some value that reflects the copy number. That value might be close to zero or one or two or might be even bigger. So there’s a probabilistic dependency edge between the input for the copy number and the presence or absence of the gene in the tumor sample that we are looking at. Other nodes represent other kinds of data ports and edges from them provide the interactions between the input and the nodes that model the activity within the cell. For example, there’s an edge between the microarray value for the expression level of the gene and the node inside the model that represents the activity level in terms of mRNA transcripts for the gene.
In this way you have a large model, a large graph essentially with nodes and edges. The data comes in on a certain set of nodes and then it’s interpreted or influences the internal nodes of the graph, which then all interact with each other. And finally, there’s a set of readout nodes. For each pathway, there’s a kind of readout node that says what the overall activity of the pathway [is]. By doing it this way, you can put different types of data that are relevant to different aspects of the tissue, [such as] something that’s directly measuring gene expression, something that’s directly measuring what’s going on at the DNA level in the genome copy number. You can even put in mutation data that is interpreted as to whether the protein might be active or not or if it has a serious mutation that suggests that even though the gene is present and even though it's making mRNA, that mRNA may not be making a protein. All of that information is taken into account.
PARADIGM is basically a model, based on the central dogma of biology and known gene interactions, for taking measurements and converting them into pathway activities.
Is this the first time this type of research has been done before or have others attempted it?
There have been many attempts to model basic cellular activities with probabilistic models. There was some early work in Stanford University, in particular Daphne Koller’s work, [Nir] Friedman, and other notable groups that have built Bayesian inference networks and applied them to some cancer samples from the earlier days using purely expression data. These types of networks have been used in different fields as well. This is the first time I think, we’ve really done large-scale work on cancer genomic data with this model.
In your paper, you compare PARADIGM and another software called SPIA, and you report the PARADIGM outperforms SPIA. What does PARADIGM do differently?
We compared both Gene Set Enrichment Analysis (GSEA) and SPIA. SPIA uses kind of a Google PageRank method. SPIA does use pathway logic but its main limitation is that it only uses expression data, so it can’t incorporate other types of information. [GSEA] is a model that does not take into account interactions between genes; it’s looking for enrichment, so it's looking for groups of genes that contain many genes with altered expressions. And so the main difference is that PARADIGM actually models the biology that interconnects those genes, whereas GSEA just treats them as a bag of unrelated genes.
In your experiment, you used data obtained on an Affymetrix platform, have you used PARADIGM on data from any other platforms? Did you have to reconfigure the software to mesh with those platforms?
We have configured the software to work for Illumina, Affymetrix, and the Agilent platforms.
Did you have to do a lot of reconfiguring?
It was annoying. [PARADIGM's] very flexible but it’s annoying to have to adapt to different formats.
[ pagebreak ]
You said for now that the tool is only used for research, so when would it be available for hospitals and clinics to use?
That’s hard to say. I can’t predict at this point but I certainly hope that either this tool or a successor that we build will be useful for hospitals and clinics. I do think that this type of analysis is the way of the future for cancer diagnostics and therapy decisions. It is a very powerful way to think about how each tumor is unique but they share a common underlying biological pattern that is repeated across tumors. They all have unique mutations but the pathways that we see altered crop up again and again in different tumors.
It’s quite a large set of pathways. [Bert] Vogelstein [of the Johns Hopkins University School of Medicine] was arguing at the last AACR meeting that they were really only 12 major pathways that explained the cancers that we are currently analyzing but there’s been a lot of discussion since and most of us disagree with that assessment. Our pathway library has thousands of pathways and we do see pathways that are significantly altered in cancers that are not always the expected pathways. Cancer does turn out to be a very complex disease, which is of no surprise to most in the field.
Are you working with anyone who is currently using or planning to use the tool to analyze patient data?
We are working intensively with the Cancer Genome Atlas project, which will analyze 20 different cancers over the next two years. We’ve done glioblastoma with them [and] we’re just finishing up our analysis of ovarian cancer. We are moving on to lung cancer, colon cancer, breast cancer. So we expect to be analyzing cancer, but again at the research level.
This is an exciting era for cancer. This is the first time that we can actually go in and do a thorough genomic analysis of hundreds of tumors and find these underlying molecular commonalities. PARADIGM is a fundamental tool for understanding these molecular commonalities between cancers. This applies between samples of a particular cancer and even across different types of cancers.
What are the commercial implications of this approach?
I think in the long run, a tool like this will be potentially quite powerful for use in the diagnostic area, so I think that there are definitely long-term possibilities beyond the research but I want to emphasize, again, that it’s purely a research tool at this point.
What are next steps for you?
Well, we have a number of projects ongoing to improve the methodology and extend it to be more comprehensive, but primarily it’s trying to keep up with the increase in data and tackling all these dozens of new tumor types that we’ll be looking at in large datasets.