At A Glance:
Name: Robert Murphy
Title: Director, Center for Bioimage Informatics, Carnegie Mellon University, since 2004. Faculty, department of biological sciences and biomedical engineering, CMU, since 1983.
Background: Postdoctoral research associate, Charles Cantor’s laboratory, Columbia University Departments of Chemistry and Human Genetics, 1979-1983.
PhD, James Bonner’s laboratory, California Institute of Technology, 1980.
BA in biochemistry, Columbia University, 1974.
In September, 2004, ProteoMonitor reported on Bob Murphy’s ongoing work in location proteomics, including his creation of an image database that allows scientists to determine at high resolution where proteins of interest are in cells (See ProteoMonitor 9/3/2004). Murphy’s work is now scheduled to be published in an upcoming special issue of the Journal of Biomedicine and Biotechnology. This week ProteoMonitor caught up with Murphy to find out more about his pioneering work in location proteomics.
How did you end up in the field of location proteomics? Do you have a background in computers or bioinformatics?
I actually am old enough that when I was an undergraduate, most schools didn’t have computer science degrees. I was very much interested in math, but ended up majoring in biochemistry and doing a PhD in biochemistry. While I was a graduate student, CalTech had a lot of computing facilities, so I started learning about what was going on in this relatively new field. I actually had maintained a strong interest and worked in computing as well as in cell and molecular biology since I was a graduate student. A good number of the things I did as a graduate student were involving computational analysis.
So I mostly focused on endocytic membrane traffic in my research from when I first came to Carnegie Mellon University, and used a lot of flow cytometry. So one of my groups has been on technology development for addressing cell biology development, so we spent a lot of time working on methods for analyzing endocytic membrane traffic using flow cytometry. Flow cytometry is very quantitative, very well suited to computational analysis and statistical analysis. And we did some work in that area, but I think more than anything else we developed a mindset that this is the way cell biology is supposed to be.
A confluence of things happened that led to my deciding to move in the direction of trying to apply similar types of methods to image data from cell biology, which included the fact that I was here at CMU in the center for light microscope imaging and biotechnology, which Lance Taylor was the director of. There was an NSF-funded project to develop automated microscopes, and I signed on to that project to try to help develop automated ways of recognizing the patterns of subcellular organelles so that one could have a microscope not only recognize those patterns but also do experiments and learn from the samples and so on, which was kind of the grand challenge of that particular grant.
This was maybe eight years ago. I [assigned] a graduate student to test the feasibility of doing automated recognition of subcellular patterns in microscope images – that was kind of the first step. It was interesting because a lot of my cell biology colleagues when I talked with them about it – about the future of cell biology, or the role of automated microscopy – I took an informal poll, and nine out of 10 of the cell biologists in that audience said that it wasn’t possible for computers to recognize organelle patterns – that was something you needed to train cell biologists to do. We already had some preliminary results to say that that wasn’t true, but we had only done a small number of patterns, but enough to convince us that it was worth going forward. And we did. We did work on classifying patterns, and it certainly came out to be the case that not only could automated systems that we developed recognize all major patterns, they could actually recognize them more accurately than humans could.
Did you have doubts that computers would be able to out-perform humans in pattern recognition?
Well, that’s a good question. I will say that when we designed the experiment that eventually showed that machines could do better than people, I had included two golgi proteins with the expectation that it wouldn’t be able to tell them apart, as almost a control that it would only be able to see one pattern out of those two. So I wasn’t as optimistic as I should’ve been. But I was pretty optimistic that it was going to be able to recognize the basic patterns, and I think again, that sort of comes from awareness of the fairly significant advances that have been made in machine learning over the last 20 years, and also this sort of concept that came from our flow cytometry work that there was a reasonably chance of being able to automate these kinds of things.
How does flow cytometry play into this research?
Well, there are so many examples in flow where people set out to distinguish two different populations of cells – stem cells, or two different types of lymphocytes – all sorts of examples where the proper combination of staining protocols and measurement techniques and then computational analysis allows you to distinguish thing. Our mind set was that this should work. We had seen similar things work in non-biological domains and in flow cytometry, which is in some ways a similar area of research.
What direction has your research gone in since then?
The direction that we’ve gone since then is trying to extend this to proteins whose patterns we don’t know anything about. There was the whole phase of supervised learning where we knew what the pattern was and we were just trying to recognize it by the computer. Now we’re going to the unsupervised stage where we say, ‘OK, here are a bunch of patterns, how many patterns are there in this?’ It’s very much similar to what people that do clustering of microarray experiments do — they try to identify what kind of proteins are present.
What kind of proteins did you choose for these experiments?
We collaborated with two of our colleagues in the biological sciences department here – Johnathan Jarvik and Peter Berget. John Jarvik had identified a technique for randomly inserting a GFP-encoding piece of DNA into cultures of cells so that you would end up with a different tagged protein in each cell line, or clone. Many of those nothing would happen, because it would get stuck in non-coding DNA, but if it happened to sit down in an exon, and that gene was expressed in that cell type, then you would generate a GFP-tagged protein.
We had a project, the three of us, that involved tagging random proteins this way, then using RT-PCR to figure out what the tagged protein was, then collecting high-resolution 3D images and then analyzing them using the same features we had shown was going to work for analyzing known patterns.
There was some effort there in identifying how best to choose the set of features for this unsupervised problem, and how to try to do the grouping appropriately for this particular case, and that’s what’s discussed in this recent paper [in the Journal of Biomedicine and Biotechnology].
We adopted an approach that is sometimes used in phylogenetics, which is to use what’s called a consensus tree to group the proteins. What you’re basically doing is clustering the mean of each protein, and the mean can move around depending on which particular images you have. That’s one of the biggest challenges in this whole area is that there’s quite a variation in cell shape and size and orientation and so on. So when we say a pattern, that pattern is really a statistical construct because no two cells are alike.
So that’s where consensus trees come in. What you do is you take lots and lots of subdivisions of the images that you have and build trees and try to find the tree that they can all agree on. And so that’s what was done in this paper.
We then built a web browser that allows you to click on any of the branches of the tree, and it’ll bring up the images that correspond to that particular protein.
What this does is for the first time give us an objective grouping of proteins by their location pattern. What people typically have done in the past is take a look at a picture and describe it in words, and then you can try to group the proteins based on the words, but that doesn’t really take into account that maybe one protein you described as vesicular, another protein described as vesicular, those may not have anything to do with each other, but they end up being grouped together because of the words being in common. Then you could try to distinguish if they are truly the same or not by doing so kind of co-localization experiment, but that’s a very labor-intensive thing to do.
The Gene Ontology consortium has developed standard vocabulary for developing locations within cells, but then when people are the ones assigning those words to proteins, and then you try to use those to group the proteins, you are very heavily dependent on which particular words the person used to label that protein, and basically what you end up with is only the major categories, because you can’t have any confidence to group them any finer than that.
What are you working on next?
What we’re trying to do now is work on ways to decompose a pattern into its sub-patterns. Right now if we had one protein that was in the Golgi, and another protein that was in the ER, and a third protein that was in both the Golgi and the ER, those would be considered three different patterns. What we’re working on now is working on ways to recognize that that ER one is actually composed of two sub-patterns.
Closely related to that is to be able to build generative models of how the proteins are distributed. By taking a bunch of the images, and grouping them, now the question is, can you synthesize new images that show what the pattern looks like? For systems biology, we need to build models of how cells work, and location is crucial for how cells work, so we need to be able to have a way to generate a synthetic cell that has the proteins in the right places, so that’s a generative model. That’s one of our challenges.
Another challenge that we’re working on is how to generalize about location from cell type to cell type. These systems you train with images of a particular cell type, but if you want to then try to say, will this work with a new cell type – of course it’ll work if you collect the same kinds of training images with the new cell type – but you don’t want to have to keep retraining for every cell type so one of the things we’re working on is ways to generalize from cell type to cell type.