Head of Computational
Pathway informatics has come into its own in the past year as an effective tool for analyzing high-throughput data within the context of biological interactions. Berlex Biosciences is one of many companies that have added pathway and network analysis to their drug-discovery programs, and BioInform recently spoke to Hugh Salamon, head of computational biology at Berlex, to discuss his group's use of biological interaction networks to study groups of functionally related genes.
Salamon explained what researchers really mean when they say they're studying "pathways," the pros and cons of specific tools in the field, and how Berlex judges the effectiveness of the computational biology group.
Can you describe the general research goals for the department of computational biology at Berlex, and how pathway analysis supports those goals?
We function within systems biology to integrate modeling and analysis with experimentation. Our work in pathway analysis has focused for the past four years on gene-set detection in transcript expression data. We have developed some straightforward methods to use protein relationships in our analyses. This past year it's really come together nicely, proving to be useful in multiple areas of research.
When you say gene-set detection, does that refer to sets of genes that are found in similar pathways?
A gene set comprises genes of similar functions or that belong to a common pathway the former has helped us in one case. [This summer] I presented [some work] at Beyond Genome focused on how biological interaction networks implicitly define interesting gene sets.
Gene sets are interesting simply for the pragmatic reason that statistics can be developed for analysis of classes that is, sets of data. Importantly, we are able to build statistical models, so that we can interpret analytical results on gene sets. A model-based approach contrasts with visual examination of large networks of relationships.
So that's a key point. We have built the gene-set testing into models, as opposed to pipeline bioinformatic information.
Can you describe what you mean by models, then? These are statistical models?
At this point they're statistical models. This approach contrast with, say, some of the work at Gene Network Sciences, Genstruct, or Entelos, where they are using different methods to provide more causative modeling. There's an interesting connection. Inference is at the heart of science, [so I] expect to see many approaches in this new, complex domain of knowledge and data.
Can you clarify what you mean by pathways? Do you differentiate between interaction networks and canonical signaling pathways?
There are three types of information mentioned in pathway analysis discussions: canonical pathways, networks, and gene classes or sets.
Rather than try to define 'pathway,' what we've done is ask what do people mean when they say pathway in a particular instance, and what are they really trying to understand? In drug discovery, we can ask, 'What can you infer about a protein's function in a disease or treatment situation that isn't a direct measurement of that protein itself?' So we asked in what situations could we do exactly that? Can we identify the potential role of a protein from measurement of those things that interact in some way with that protein? Whether they are dependent on it for regulation, or whether they simply bind to it, do they tell us a story about it?
How successful would you say this approach has been so far?
I'd say that in the last six months we've been extremely successful at understanding some specific disease biology; consistent information that is not available through other tools has come out of our current computational biology approach to pathway analysis. But it's still early. For example, we're still working on how to integrate numerical results into good visualizations, because that matters. Also, we face the challenge of maintaining increasing amounts of information that must be drawn on for analysis knowledge bases, experimental data, and analytics. That's where our interest in Oracle comes from, because we feel that the way data is typically managed moves it far from analytics, and that isn't necessarily what we want to see in the future.
Are you using Oracle's network data model for this work?
We're exploring it. We are at a stage where we will want to take a good concept and make it useful for hundreds of analyses in the future. In other words, can we use data analyses to create a good memory for drug-discovery efforts in the company? Or are they just these one-off analyses? Can we help the company remember what it's discovered about biology?
What is the breakdown at Berlex between in-house informatics tools and third-party tools?
I believe that companies like Berlex should simply bring in modular pieces of technology that help. No one vendor can provide you the solution for managing all information for your drug discovery in your indication of interest in your existing company it just doesn't happen that way. Data integration is successful when it helps the data analyst, so data analysis planning in our department involves discussions on data models and deep data integration. Once we focus on our models, we find that technology can be incorporated in a modular fashion, is replaceable, and reduces our dependence on specific technology modules. Had we gone in the direction of someone else's thinking, rather than providing customized innovation to meet our therapeutic research collaborators' needs, we probably wouldn't have succeeded in providing the results that we needed to provide in a timely fashion. And we certainly wouldn't have discovered some important possibilities for the future.
How about for the pathway informatics tools? Are you finding that they plug into this modular approach that you've developed?
In some instances. For example, the data content from Ariadne Genomics is extremely interesting. However we chose to do our own calculations on these data in order to use them systematically, by employing statistics we trust, and we're very encouraged by what we're seeing. We're interested in many other providers and types of content. We're interested in the visualization tools we could build with Cytoscape, whether that's a platform that we could work with, and perhaps to which we could contribute.
The pathway databases are well developed, but what have you come across in terms of effective tools for mining and comparing that data?
We built our pathway-analysis tools, and recently we were challenged to compare them to Gene Set Enrichment Analysis, and I was actually surprised at how much more power we could show. We also have found through testing platforms that our hypothesis-testing approach provides results far easier to interpret than the output of generic tools comprised of bioinformatic pipelines, which digest lists of differentially expressed genes. We implement methodology directly into analytical results shared with therapeutic researchers, de-emphasizing tools used by others without our involvement as data analysts.
Another outcome of developing effective analysis as a part of ongoing research is that we become good members of the scientific community and participate in computational biology as a science. This is especially important to me because the scientific emphasis facilitates bringing excellent people into the company.
So it sounds like any tools that you're developing would eventually be available to the broader community.
Certainly the methods are fully described, they are mathematically completely described. Whether the software itself gets out there for example, as a plug-in to Cytoscape is a business decision.
Agilent has contributed a plug-in to Cytoscape, so you wouldn't be the only commercial contributor.
Oracle has a plug-in, as does Agilent, so I think this is a new direction. I think that the way that the [Cytoscape] software license was designed using the LGPL was very smart, because it says you can use it, and you can connect to it with things that are not open source. This is good, because if you try to force everything to be open source then the commercial world usually just ignores it.
How does Berlex determine that your group is making an impact? What are the criteria?
The interesting thing to consider is how should specific criteria for success be determined, when the real business outcome is 15 years away? Therapeutic researchers simply want to be able to judge whether they like working with us. But I need to measure some business impact to evaluate my team's contribution. So I try to say, 'How many new targets are going into confirmation because of something we helped with? Do we see new concept research, do we see new directions because of our presence?' It's very hard to predict [the] longer-term impact of computational biologists, even harder to measure what the future would be if we didn't bring both systematic data analysis and systems thinking to bear on our drug-discovery challenges.
It must be hard for computational biology groups to claim credit for specific compounds.
I can't even do it for target discovery because of the collaborative nature of this business. However, that might be the key difference between integrated interdisciplinary research, and trying to set up a technical platform with the hope of avoiding so much research. I don't think that medicine and biology can be engineered or computed away, but I do think we can learn more quickly and also learn qualitatively different biology, biology that would be missed without deeper data analysis. I am sure that some degree of personalized medicine, better targeted medicine, will become a reality with the crucial help of innovative data analysis. When it comes to the data analytic approach, the way I like to measure its success currently is that we go back to systems that we understand, and we say, 'Do we find at least what we ought to find?' And the interesting thing is, you don't always. That is how we realized that collections such as [the] Gene Ontology were really limited. Gene Ontology, KEGG very useful sometimes, but they're just too small by themselves because there are many biological systems that, when we analyze them, the story isn't in a Gene Ontology group or a KEGG metabolic pathway. We realized that we must gain access to more data out of the literature, and that's when we started to really listen to the companies that are creating libraries of information.
A disease may be poorly understood because the biology's poorly understood, and therefore you're not going to find anything in the literature that's useful. Or it may be that the data is there, but the context has not been provided for a human to gain insight using the data. Until early this year, I don't think the systematically cataloged information was big enough to provide context for most of our data analysis challenges. Our assessment in early 2004 was that we weren't ready to license any of it because when we looked at data-content products, we found plenty of simple information, or 'findings,' just reading abstracts, that was missing. Now, in 2005, it's looking as if natural-language processing is more effective, and people are doing more hand-curation, or their teams are catching up. So I think we're only seeing the beginning of getting enough information.