Gil Alterovitz, research fellow, Harvard Medical School and the Massachusetts Institute of Technology
Gil Alterovitz, a research fellow at Harvard Medical School and the Massachusetts Institute of Technology, focuses on developing new computational methods for analyzing biological networks and signal-processing techniques for proteomics.
He is affiliated with the Harvard Medical School Partners' Center for Genetics and Genomics and the Children's Hospital Informatics Program, and his projects blend aspects of engineering, computer science, and medicine.
Alterovitz recently published a case study on how his lab is using The Mathworks’s Matlab software to support its research in identifying biomarkers for ovarian cancer. He is also scheduled to present a paper on the analysis of biological networks at the annual meeting of the American Medical Informatics Association on Nov. 11-15 in Washington, DC.
BioInform spoke to Alterovitz this week to find out more about his work.
Tell me about your research. I understand that you’re working on identifying protein biomarkers for early detection of ovarian cancer?
That is one of our areas. The emphasis is on developing new computational methods using Bayesian approaches and other methods to find biomarkers or to find areas related to disease. Ovarian cancer is one of the prototypical disease areas that we have looked at because it is so bad for patients if not discovered early; yet, if it is discovered early, we can make a much bigger difference.
If you discover ovarian cancer late, say in stage 4, there is a less-than 50-percent chance of surviving in five years. However, if you discover it early, in stage 1, there is around a 95 percent chance of surviving. So if we can detect it early, it would be much more promising for patients.
What are the primary computational and informatics challenges associated with this work?
There are a lot of challenges that we are looking at in this area of proteomics. We are looking at many proteins at the same time, and in this particular case, within the area of proteomics, at the field of mass spectrometry. In this domain, the advantage of the technology is that you are able to look at proteins in parallel. You can look at a sample of blood, for example, and look at many proteins at the same time. That is the advantage of it, as opposed to traditional approaches where you might just study one protein at a time.
The disadvantage in mass spectrometry, for example, is that the result comes back as these peaks and valleys, and it is kind of hard to tell which of these peaks is associated with which proteins, and which valley is associated with noise or other issues. Also, sometimes more than one peak can be associated with the same protein. So resolving these issues, and trying to identify the relevant proteins and which proteins do not just exist everywhere but exist in the disease and not in the normal case, as opposed to noise — that is one of the challenges in the field.
How are you addressing these challenges?
The idea that we had was to look for certain probabilistic patterns in many mass spectra together — across similar patient samples. That way, we are able to see what is noise and what is not noise, and to see which of these peaks move together across patients of the same type. Lets say that for all the peaks that exist together in all the normal patients — and even within the normal patients there will be some diversity and some variability — [but] if we’re able to see which proteins are moving together, those might be affiliated with similar pathways, and then we can try and identify proteins using that information.
So it sounds like you’re working directly with the raw spectra rather than identifying the proteins first. I know there’s some debate in the field of proteomic biomarker discovery regarding that issue.
That has definitely been an issue. Some people are of the camp where you can essentially think of it as a fingerprint. So in a fingerprint, you may not know exactly which ridges are defining the fingerprint, but if you look at one fingerprint and look at another fingerprint, you can compare the two and think that they are the same without actually identifying a particular feature.
Other people believe that that is looking at it like a black box. But, if we look underneath, we may be able to identify fundamental biological mechanisms. So if we know, for example, in the fingerprint, that a particular ridge is causing that fingerprint to be identified as such, this information can lead to more optimized tests that only have to look at maybe one or two proteins. This can lead to us to better understand the mechanisms involved. Then, we might be able to find other proteins that were not immediately obvious in the results that we first observe, but are actually more important in giving us prognostic information for patients.
So which camp do you fall into?
I have taken both approaches, but I think I would fall into the camp of the latter approach, which involves trying to dig deeper than the black box and trying to identify the proteins. So the way we are trying to do this is by taking the initial fingerprint and then looking at many of these fingerprints across time and see how they change, and then we map that onto different networks involving different proteins.
That way we can try to identify proteins and potential pathways that might be interesting for further investigation, which could involve other techniques. You could look at it with proteomics, or see which genes are involved and use other methods.
I understand that you have a collaboration with the Mathworks. Are you using their software to map this information onto networks?
We use their software for many of our projects including signal processing for mass spectra and distributed computing for large datasets. Since a lot of people in the lab have an engineering background, we have always written code in Matlab in the first place. A couple years ago they came out with a bioinformatics toolbox, and that made it even easier for us. Before they came out with a number of these tools, we would sometimes have to work with other programs and try to interface that with Matlab. That tended to be a little more cumbersome. Now, we are able to work directly within Matlab, so the process is faster and more efficient.
So you’re also using Matlab as development platform in addition to using the bioinformatics toolbox?
The way I see Matlab is as a meta-platform. For example, Java is underneath and even runs part of Matlab, so they have a JVM — a Java virtual machine — that draws many of the components in Matlab. One can call Java classes directly from Matlab. This is also the case with Perl, so if you have a Perl script you can call it from Matlab. There are a few others that are embedded in there, so you can integrate that directly.
For example, if I find that a student who is really much more interested in programming in Java, then they can program in Java, and we can integrate that tool in with other people who like to program in Matlab. And that is kind of unusual because usually when you pick one language, you kind of have to stick with it, and here you can use different ones. If you like R, there is a conduit to talk to R. So there are many possibilities. This reduces the learning curve for students and increases the lab’s productivity by focusing on research, rather than on learning a preferred programming language.
You mentioned that many of the people that you work with in the lab have an engineering background. What kind of interaction do you have with bench biologists?
We have that as well, but they usually are not the ones doing the programming part. They will do the wet lab biology and give us some of the data and we will do the analysis. They may engage in high-level discussions with us also to try to understand and predict which pathways might be important.
What role do you find that interaction plays in developing new computational methods?
They are usually the inspiration for why we will start on one method versus another. I have found in the past that if researchers were not interacting with biologists, they might try to solve a problem that no one really cares about.
Other times bioinformatics researchers have solved problems [where], yes, it was a problem that they were interested in solving, but they did not quite solve it the way that the biologists would like to solve it. When one has a solution, one might have some input parameters and so forth, and those input parameters might be numbers that are non-intuitive to the biologist. So if bioinformatics researchers do not translate the solution and inputs, and transform that solution into something that a biologist can deal with, then the users will not be able to use that solution effectively. And, they will always need a special interface to that solution, and that is not as effective.
That is one of the other advantages if you have a meta-platform. For example, there [are] ways to build GUIs automatically and web servers and interact with other types of web applications and other programs. So you can build interfaces and so forth that might be more comfortable for someone who does not want to get into the code, but wants to be able to use the tool.
So, for example, we have built interfaces with Java, we have built interfaces with Matlab’s interface designer, and also with the web — HTML interfaces to databases. All those are possible.
So it depends who the target user is.
Your web site mentions that you have a partnership with Microsoft as well. Is that related to this interface work?
That is a partnership that started last year, and we are part of their academic alliance, so they give us almost all of their software so that we can use it for academic purposes, research purposes, and so that people can use it in our lab, and also some educational resources that they may have.
It is just starting, but there has been talk of possible collaborations in the future between members of one of their groups and parts of our group here.
It is helpful for the lab. I think we have almost a thousand CDs through this program. And, we can use it in our research — database-creating programs and so forth, and in some sense, that makes them sort of competitive with a lot of free solutions out there. By having this academic alliance, we are able to test out commercial software and use it.
It seems like it’s still kind of unusual for bioinformatics developers to be open to commercial tools.
A few years ago, that was probably more of a big trend because there just were not that many tools out there and the tools were very generic. I think the tools now are starting to really mature, so there is a lot more customization available, there are much nicer user interfaces available, and you are able to do scripting that you were not able to do before — without customized solutions.
Basically in the past, people could not find tools that were useful for what they needed to do, so they had to start from scratch. But starting from scratch is, in some sense, a waste of an effort. It can be much more efficient if you can start with a building block because then you are able to get to the science much faster. So, the more you can build on something and have a higher level of abstraction as you design the experiment or program, the more fundamental questions you will able to answer, rather than focusing on developing the tools.
And maintaining them.
Maintaining is also a big issue. When someone says, ‘Let’s make a database,’ I say, ‘OK, but who is going to curate it?’ You do not just build a database, you have to maintain a database. It is like a car — you do not just buy the car, you have to maintain the car. That is a large part of the effort in databases. In fact, some companies even make a living from maintaining databases. For example, Unleashed Informatics gives out many databases for free. Its business model is based, in part, on maintaining database servers.
What’s on your plate right now? What are your short-term goals in terms of methods development in this area?
In terms of methods, we are tying to focus on using Bayesian methods to look at the dynamics of networks — how networks are changing over time, and how they are evolving.
Are these mostly protein-protein interaction networks?
There are other types of networks that we are looking at, like gene regulation networks and metabolic networks.
Are you primarily using proteomic data?
There could be other sources of data, like microarray data for gene regulation. For the metabolic networks, there is a lot of data already out there — we just have to integrate it and put it all together.
We are working on a multi-modality visualization of protein interaction and other networks in 3D and 4D — across time. We are working on a few related projects as well.
It seems like as this information grows ever more complex that visualization is going to become more of a challenge.
That is actually the topic of a conference presentation we just had accepted. This paper is about: how can we look at the core of what is really important in complex network? If you just look at the network, it is just what some people call a big hairball. So that’s not really informative. It might be pretty if you color it nicely, but you can not really figure out what is going on there. It just looks like a mess.
So, what we want to do is extract structure from that, and we have a method that we are going to be presenting at AMIA — the American Medical Informatics Association meeting in Washington, DC. Through this approach, we are able to focus in and compress the network in a way that it maintains the most important variability within the network.
So in a network transformed via our method, you are able to see some core details that were not visible before. Then, we looked at those details and were able to see that those clusters turn out to be functionally relevant. So, for example, the genes in the cluster were found to be involved in glutamate metabolism significantly more than would be expected by chance alone.
They are very obvious, these clusters, once you do the compression. Networks will look very messy even in 3D. So a hairball, if it is 2D, which is usually what people now have, that is very messy. So we have to have some other methods to visualize and understand such networks, and that is what that paper is about.
We are expanding this work now, looking at other networks and seeing what information you can tell from one network versus another — using methods like this.
In that paper, I think we only talk about E. coli gene regulation, but now we are looking at other organisms and other types of networks.
Do you make these methods available publicly?
Sometimes they are made publicly available. Other times we discuss how to implement the method in the paper. The code might be custom designed for what we do in-house, so other people will not be able to interface it very easily without using a lot of other code. So it is kind of like if you have code for one little dialog box in Microsoft Word. That is not going to work until you have the rest of Word’s code.
Now, in some cases, when we have a whole platform — as if it was a whole program like Word. We have a platform called OBOES — the Open Biomedical Ontology Enrichment and Search platform — that we are working on, and the whole platform is available for download. We put it on sourceforge.net [http://oboes.soureforge.net], so it is open source. People can download it, modify it, and upload changes, and there is a forum where you can discuss changes or propose changes. In that case, the user may be a biologist or someone who knows a little bit of programming. So, it really depends on the audience.