Professor of biostatistics and computational biology
At A Glance
Name: John Quackenbush
Position: Professor of biostatistics and computational biology, Dana-Farber Cancer Institute, since March 2005.
Background: Investigator, the Institute for Genomic Research 2002-2005. Associate, assistant investigator, 1997-2001.
Research associate, Stanford Human Genome Center, 1994-1996.
Staff scientist, the Salk Institute for Biological Studies, 1992-1994.
Postdoc in experimental particle physics, University of California, Los Angeles, 1990-1992.
PhD in theoretical particle physics, University of California, Los Angeles, 1990.
The Dana-Farber Cancer Institute said last week that it will be establishing a new proteomics center using a $16.5 million gift from John "Jack" Blais and Shelley Blais (see story). ProteoMonitor spoke to John Quackenbush, who will be heading up the computational arm of the new center, to find out about his background and his ideas for the new center.
What is your research background, and how did you end up heading up the computational arm of the new Dana-Farber proteomics center?
That's sort of an interesting trip. My background is actually not in biology. My background is not in computer science or statistics. My background is in theoretical physics. I did my PhD in theoretical particle physics. So I used to sit in a dark room all alone and do obscure mathematics.
I finished my PhD in 1990, and… [around that time] physicists had made this strategic marketing error. They'd sold physics as being necessary for fighting the cold war, even though nobody I knew did anything even vaguely related to defense. But it was really an easy way to get funding for physics.
In 1990, when the Soviet empire collapsed, and the Berlin wall came down, funding for physics in the United States more or less evaporated overnight. You might remember something called the superconducting supercollider. It's this big particle accelerator they were going to build in Texas. They dug about a third of the 54 mile tunnel, and then stopped, because the money got cut. So really I was part of the lost generation of physicists, and I had to figure out what to do with my life.
So, about the same time all this was happening, it turns out I had a girlfriend who was a biologist. And since I was a theoretical physicist, I could sit in her office or anywhere else and do my work, while she was working in the lab. And so I learned that all this biology had been invented since I had taken my last high school biology class.
When I finished my PhD, I was having trouble getting a job because the job market was getting really tight. I did my postdoc in a field called phenomenology — a field that straddles the border between theory and experiment. But then in 1992, even that money disappeared, so I was really strapped for something to do.
I got very excited and very interested in all this biology — especially molecular biology, molecular genetics, and molecular evolution. I had audited a number of courses and thought this might be a good area to go into. I was encouraged by the people I had met in biology.
What happened was, about the same time, the Human Genome Project was getting off the ground. And people at the NIH, at what was then the National Center for Human Genome Research — now NHGRI — they realized that in order for the Human Genome Project to be successful, they would have to bring in people from engineering, computer science, mathematics, physics, chemistry — to contribute their expertise. And in particular, one of the things I realized about physicists was that physicists were very good at working on large-scale projects, and projects in which there were many people working together to achieve a scientific goal.
Why is that?
Well, look at experiments that are done at particle accelerators, where you have 300 authors, or more. Compare that to the way in which biology is usually done — if there were more than three or four authors on paper, people used to look at it very skeptically and try to figure out if anybody actually did any work.
So it's sort of a different culture. But to do genome-scale science, one of the things you realize is you need people who have different areas of expertise to come together.
I applied for and received one of these fellowships — a career development award — from the Genome Institute at the NIH, and in my application, what I said I really wanted to do was bring the quantitative skills I'd developed in the physical sciences to bear on problems in biology. And I really made it explicit that I wanted to learn biology in the process. So I spent two years working with a group at the Salk Institute doing physical mapping on human chromosome 11. I did lab work, and I did a lot of work to build the tools we use to analyze the data.
Then I went to Stanford, where I worked with Rick Meyers and Dave Cox. They were doing largely radiation hybrid mapping of the human genome, and I got involved with that. But my real project there was working on a pilot project to sequence regions of human chromosome 21. And again, my work spanned the laboratory and the laptop. I mean, we were working on trying to build the infrastructure to do sequencing, so we were using a novel approach to try to minimize the number of sequencing runs we did, and maximize the information per read.
I worked to develop the laboratory protocols, and to get the lab set up, but also to really deal with issues in managing the data. After spending two years there, with one year left in my fellowship, I realized it was time to sort out what I wanted to do with my life.
It was pretty clear the genome was going to get sequenced. It was pretty clear I wasn't one of the people that was going to get anointed to do that. So I sort of stood back and surveyed the field and asked where the next big challenges were. To me, it seemed like the challenges were really in the interpretation of the genome sequence.
About that time, microarrays were a technology that was really just getting its start. This was around 1996. So I started looking for jobs in different places. When I went and interviewed, I said really what I'd like to do is build a program to analyze patterns of gene expression.
I was offered and accepted a job at TIGR — the Institute for Genomic Research in Rockville, Md. I spent eight years there. What I did was I built a program around gene expression profiling. For me, that really meant a combination of laboratory and computational approaches. It meant doing the experiments, collecting the data effectively in the lab, but then also building the infrastructure to collect, analyze, and manage the data. And to tie it back to the biology.
I think one of the mistakes we often see in bioinformatics is that the analysis, collection, and management of the data is often divorced from the process that generates it. If you don't understand what the experiments are, what the questions the researchers are asking, how the instruments that generate the data function, I firmly believe that your ability to analyze that data is compromised. Because you don't know whether subtle signals you're seeing are really reflective of the biology, or reflective of some systematic problem of the technology.
My group has gotten really well known for building software and building databases that can be applied to analyzing gene expression. But in fact, I think a lot of the value in these tools comes from the fact that they were developed in close partnership between laboratory biologists and computational biologists.
In March of this year, I moved to the Dana-Farber. A big part of the reason is over the last eight years, my work has been increasingly focused on understanding mechanisms of human disease. And it's been increasingly focused on trying to apply our understanding to developing a better picture of what's happening in human cancers.
To me, when I was offered the job here at Dana-Farber, it was absolutely an outstanding opportunity to get involved in a place where the resources and the commitment existed to look at human diseases, and to look at cancer.
I joined the faculty here in March 2005. I think the interest in recruiting me here was to build capabilities here in computational biology. My appointment here is in the department of biostatistics and computational biology, but, in fact, I'm probably the only person in that department ever to ask for lab space. So my group is actively doing lab work.
What projects is your group working on?
We've continued largely to focus on gene expression profiling. Part of the reason for that is that the technology is much more mature than proteomic technology. Mass specs have been around for a long time, but the ability to look at thousands of proteins is still evolving.
One of the exciting things about coming here was the fact that the Dana-Farber was also very interested in building a program in proteomics. So they recruited Jarrod [Marto], who I think is one of the most talented young guys in proteomics. They were interested in soliciting funds to develop a proteomics center, which they did quite successfully. Here at the Dana-Farber we owe a great debt to Jack Blais and his family for contributing the funds.
But really this gives us the opportunity to take information gathering and data analysis to the next step. If you think about a cell, the cell is basically a machine made out of proteins. By looking at RNA, we're able to see a lot of changes and a lot of differences between different cell types or different disease states, but we're really looking at the surrogate for what we really want to find — the proteins.
In many ways, proteins are a much better window for the biology of the cell than RNA is. But, one of the things my group and I have been very successful at is integrating data from diverse domains.
So one of the things that's most exciting now about the growing capability we have here to do proteomics is that, yes, we can develop mechanisms for collecting this [proteomics] data, and doing it in an automated fashion, at least up to a stage, but in addition, we can start to look at bringing together information from gene expression profiling on the same patients, or the same samples. To bring together clinical information, and information gleaned together from the literature about the underlying biology of these processes that we're studying. We can really try to make out of all these diverse pieces something [that] is more useful and better integrated and more faithful to the biology than we could by looking at each individual piece on its own.
Has your group developed software to integrate the different types of data?
We've developed software to do that kind of integration, working from DNA microarrays towards proteomics, and what this center is going to allow us to do is now to move more fully into looking at proteomics, and integrating that with other information.
Really, the same types of techniques that we use to analyze the expression data — once we get down to which proteins are expressed in which situation — are really very similar. We have a big matrix, which [includes] samples by proteins, instead of samples by genes, and we have a quantitative measure for the expression of those proteins in each one of those samples.
So computationally, the data — once we get beyond the fact that it's mass spec, and not gene expression profiling data — when we're really doing the interesting and challenging meta-analysis, that's where the techniques that we've developed for expression arrays are really very similar to the techniques that you want to use with proteomics.
We've been very interested in trying to tease information about networks and pathways out of microarray data. One of the things that we've managed to do is to integrate microarray data with data that comes from proteomic studies.
Marc Vidal here at the Dana-Farber has been generating protein-protein interaction data. What we can do is we can look at the information that comes from these protein-protein interaction screens, and that gives us a list of which protein interacts with which other protein. So, again, we can build a big matrix of all the proteins and all the proteins that they interact with. It's a proteins-by-proteins matrix.
We can use the skeleton of interactions that comes out of that, along with microarray expression data — and the approach we've been using is one called Bayesian networks — to try to learn or guess what the likely interactions are that are expressed both in the protein-protein interaction data, and the microarray data.
One of the things we've discovered in doing that is that the two datasets together, even though they are from very different domains, are much more powerful, and show us more real, known interaction, than any dataset by itself. If we bring the datasets together, we get a much better picture of what's happening, and what the real networks look like.
So what we want to do is analyze the data that's going to be coming from this new proteomic center, and bring it together with other information in the same way.
One important question is, if we see proteins being differentially regulated, are they proteins that are known to interact? Do they form complexes in the cell? Can we map them back to certain functional classes? Are they genes involved in energy metabolism? Are they genes that are localized to the mitochondria?
The more of that type of information we can integrate into our analysis, the more complete picture we're going to have. So microarray data is one type of data we can layer on, but, in fact, we have hundreds of years of biological knowledge that we can use as a tool to hone in on the real signal in the proteomics data.
For the new proteomic center, will you be advising people on what kinds of experiments they might want to do to find a certain type of result?
Part of what my group is going to end up doing is to try to help people sort out the best way to do their experiment, and the best way to analyze data. But that piece of it is really a research goal.
With Tim [Yeatman at the Moffitt Cancer Center in Florida] we're working on breast cancer, but here we're working with a number of people on different diseases. This is going to be an opportunity to generate proteomic data on those diseases.
We're going to be in a position to try to develop new methods for analyzing this data. We have a lot of templates we can use to really try to understand what's happening. We have a draft genome sequence, and we can start to link the protein profiles back to the genome. We can tie it in with genetic information about development of the disease, and SNP data. There's just a whole host of data that's available.
Really one of things everybody in my group is excited about is the idea that as we generate this data, we can work to develop new methods to learn new things that we couldn't otherwise.