John Quackenbush is a professor of Computational Biology and Bioinformatics at the Dana-Farber Cancer Institute. He also heads Dana-Farber's Center for Cancer Computational Biology, which opened earlier this year.
The CCCB has both a research focus and a support role. Quackenbush said that about 80 percent of the center's funding is dedicated to support, which he and his colleagues refer to as "collaborative consulting" because it is run very similar to a professional consulting service.
The CCCB offers bioinformatics support for a range of applications but has a particular focus on microarray analysis. The center's key software tools include the MultiExperiment Viewer, or MeV, an application designed to find differentially expressed genes in microarray data sets; the Automated Microarray Pipeline, or AMP, which is used to perform normalization for microarray experiments; and GCOD, a collection of publicly available microarray gene expression data on Affymetrix GeneChip arrays related to human cancers.
BioArray News recently spoke to Quackenbush about his work at Dana-Farber and the CCCB. The following is an edited version of the interview.
Who has been using the CCCB?
It's been interesting. We initially thought we'd be overrun by laboratory biologists — people generating array and sequencing data. We thought we'd have to work to get clinical scientists in the door. We discovered that a lot of lab-based research people have found solutions over the past few years. We are seeing more physicians that are trying to integrate genomics into their clinical trials. Surprisingly, we see a lot of people doing molecular pathology; people analyzing tissue array data and not knowing how to analyze it. We have formed a partnership with Max Loda [a professor of pathology at Dana-Farber]. He's done a lot of tissue microarray analyses using software that gives you digital output. We get quantitative measures of different stains on these TMAs. We have developed new methods for analyzing those data that people who do molecular pathology are very excited about.
It will be interesting to see how this evolves, but I think going forward we'll be able to provide support to do their analysis. There's a wealth of software and it isn’t easy to use. So, it's not hardware or software that’s limiting. It's grayware — having the right minds to take advantage of data.
You developed the microarray software suite TM4. I noticed that has recently been updated. Can you give me an update on that resource?
We are funded by the National Library of Medicine to develop this suite of software tools. It's incredibly highly used. We've had 25,000 downloads last year, so it's really been a phenomenally successful tool because it addresses the challenges people face in developing software tools. Any really successful software tool has to be useful and it has to address important questions and use methods that give researchers insight into underlying biology. It should also be packaged so that people can use it easily. We have been doing regular updates on the software and incorporating methods that are becoming default standards for doing analysis.
For example, we have been using Bayesian networks to predict how a system will respond if you disturb it a particular way. I think part of the reason TM4 has been so successful is that when developing the tools, we ask the question, 'For a naïve user, someone sitting in a small lab in a state university in the Midwest or somewhere in India, will this be simple and intuitive enough for them to analyze data, get results back, and interpret it?' We are also thinking about how to maintain MeV and how to move it forward.
Another tool that has been enormously successful is Bioconductor, but those tools are idiosyncratic and hard to use. We are now trying to create tools to make MdV serve as a front end for Bioconductor. We are also trying to use [the Bioconductor module] LIMMA [Linear Models for Microarray Data] as a flagship to help people who don’t have analytical acumen to use the tools. That should be released in December. We are developing a way for people to plug MeV into Bioconductor as front end.
Are you developing any other tools?
We're working to develop a data normalization front line called AMP. We are also working on [the gene expression signature database] SigDB with Aedin Culhane [a research associate at Dana-Farber]. It addresses an important problem in the community.
[ pagebreak ]
I have been involved with [the Microarray Gene Expression Database Society] since its inception working to establish standards for reporting array data. The success is mixed in how data is reported. We worked to establish standards. That’s done to a greater or lesser extent. The raw data is now being gathered in places like [the Gene Expression Omnibus] or Array Express. But it has to be annotated and curated if we want to do a further analysis. One thing we discovered a while back is that a lot of published results from using these data, typically in the form of gene signatures, aren’t available in a standard format anywhere. The best example is a signature like MammaPrint, which evolved into a diagnostic application but doesn’t appear in any public databases. It's not in ArrayExpress or GEO. It's not set up to capture the results of an analysis. If you look for a signature like that, you'll discover those signatures exist in figures or tables or supplemental data, but they are never in any easily computable format. They are also presented in all sorts of non-standardized formats. You end up with a Tower of Babel in trying to understand what people have already discovered.
We started a project where we got high school students to come in and standardize the signatures and put them in a database where we'd be able to do searches and compare them. That database was released this month. We are only now beginning to understand the underlying biology. But maybe we could get a better understanding of relationships between common diseases in ways that we were unable to do before in a comprehensive way.
So we build tools, but it's based on the interest we have in being able to address fundamental biological questions. Advancing the field requires that we make these tools available to the rest of the community.
Since you are doing consulting, what should the computational biology community make of all the free resources that become available almost on a weekly basis? I'm told some are good and some are not so good.
A lot of the tools that get published never have a big impact on the field and a lot of methods that seem hot for a few months end up being not that useful. There's a whole host of tools and whole host of methods, but we often fall back on very standard approaches. Almost any genomic technology will give you a samples-by-measurements matrix. At that point, one question is a statistical question – what are the measurements that are different between the phenotypic groups we have? At that stage, what you are doing is pretty standard statistical analysis. The next step is taking those measurements and trying to interpret them biologically. The interpretation of data is almost something that is project by project; it's rare that some existing set of tools is available. There are growing numbers of tools, but the process of aligning those things together is very labor intensive and it's a manual enterprise.
The most creative things come when you look at integrative analysis. How do you take microRNAs and RNAs and put them together? There's not a good consensus for how to do those kinds of things. So we are seeing a lot of queries that require adaptation of standard tools. Cutting-edge things are cutting edge because there's no consensus in the community about how to do certain things.
CCCB has a research component. We should be forward thinking and generate datasets that require us to look at new methods. We are trying to put the pipelines in and get a sense of the tools we need to in a proactive way address the problems people will face when generating those data types.
You've been involved in this field now for some time. Are there still the issues with inter-platform data comparability that existed in previous years?
Victory has been declared multiple times. The last time I addressed this with my own work was in 2005. there were three papers that appeared [in an issue of Nature Methods] and they all came to the conclusion that arrays give you insight into fundamental biology across all platforms. I think part of the problem is that often when people talk about comparability between different platforms we convolute two different arguments. If you take two samples and profile them across different platforms, you can deduce the underlying biology across those platforms. They give you consistent results. The real challenge is not arrays but a more fundamental challenge when you look across platforms between studies. When we do that we are looking at different populations and different things. The big challenge comes down to the fact that any disease we've looked at is more heterogeneous than we recognized even a short time ago. There are definitely underlying mechanisms and causes behind diseases. The reason signatures are different is a reflection of the fact that we are looking at different populations, different subpopulations, and different cohorts where there are effects that create differences between different groups. Part of the reason we have hard time looking between studies is a failed implementation of [the Minimum Information about a Microarray Experiment guidelines]. They don’t do a good job reporting what they are seeing. Datasets aren’t reported in terms of annotation that we should be seeing.
[ pagebreak ]
We are really running into the fact that disease is complex and there are a lot of factors that impact any disease. Diseases are very individual. To understand the core elements, we have to look at common threads that run through these diseases. Gene SigDB allows us to do that so we can look at commonalities between them.
Another issue is integration of gene expression, protein expression, and other data. Maybe you can you give me an update on your research in this area?
There a number of different people who are making some progress. A lot of studies have focused on a single dataset in a single cohort and there are a few large integrated datasets where data has been collected on the same individuals at the same time. What a lot of people have tried to do is take miRNA data from one study and look at mRNA data from another study; that doesn't work. In order to do this you have to be able to generate datasets on the same underlying group of patients. We are starting to see that done using binary combinations — using copy number and gene expression data, or SNP profile data and gene expression data. The challenge is taking those and trying to link them together. I think it's still plagued by the gaps that exist in our understanding of the relationship between these things. When we look at the problem of data integration, we need not only more data and more integrative data but more types of data to integrate. We have to think of these data in a biological context. I hate the term systems biology because don’t know what it is. It’s a term they use to make their research sound important or trendy and fundable. But at the end of the day, what we want to be able to do is apply prior knowledge about biological systems as a guide to figure out the problem of data integration.
We have been trying to take a more pragmatic approach and one motivated by my background in physics is phenomenology. We want to use experiments to generate data to generate observations. We want to use those data to build models. We want the model to capture the essence of experimental data. The problem is not if the model is wrong or right, but if it is useful and be used to make predictions that can be validated. If we can do that, we've got something that’s useful. That's where we've focused our research, to see if we can we take existing datasets and use their insight to build most useful models and use those models to make useful predictions.
What is your assessment of the expression-based diagnostics that have come out so far from firms like Roche, Pathwork Diagnostics, and Agendia? Will there be more chips like these, or is there a better way expression data could be applied clinically?
It's hard to tell. My answer depends on the day on which you ask me. I think there will be more genomic-based assays that come to market, whether they are developed for diagnostic and prognostic use or trying to stratify patients. It's been interesting to see what has happened on Harvard's medical campus. I would have thought that Agendia's MammaPrint 70-gene signature would become standard. But Genomic Health's Oncotype Dx has taken over as part of the standard of care for women who have node-negative breast cancer. It's turned out to be part of the criteria in managing healthcare.
There are definitely more of these tests on the horizon. Part of the challenge is moving from an initial cohort to validating them in a larger cohort. I wish that people who are trying to validate these tests in larger cohorts would use more comprehensive technologies. What I would love to see is clinical trials that are ongoing looking at MammaPrint where they will score 10,000 patients in overlapping trials in Europe and the US. If we had full-genome, gene-expression data on those patients, we could come up with a more robust predictor. But the way trials are run with something like that will be to confirm the signature. It's harder to get funding and approval to generate that kind of whole-genome data. I would almost bet that once you get those kinds of population sizes with data collected in a standardized way, you could overcome problems in generating those kinds of signatures.
So, today, I think these will become the standard of care in a large number of diseases. But on a different day, I also start to think about the next great technology. We all talk about next-generation sequencing and its future. I really see NGS being adopted, whether the current generation or the next one. As these technologies improve, the cost of generating whole-genome sequencing data is falling. I don’t think we are that far away from the $1,000 genome that so many people talk about. The growing consensus is that the $1,000 genome is only five or six years from now. That’s a game changer. I think these sequencing technologies can be used once proven robust and reliable in a clinical setting to resequence tumors or to profile gene expression. I think they will become ubiquitous medical research tools.
In the next few years we will have thousands, if not tens of thousands, of individuals sequenced. As we understand the relationships we see, understanding of normal and abnormal, as we start to build up that body of knowledge and understanding, we'll be in a position to exploit that data. Array-based technologies have a useful horizon of 10 years to 15 years, but, at least at the DNA level, these approaches will be supplanted by next-gen sequencing technologies.
Perhaps you could provide me with an update on your next-gen sequencing-related projects.
There are two answers to that. One is happening through the CCCB. We really wanted to have a strong research focus. The group is also moving more towards doing research on cancers that affect women. Two years ago we got a grant from Aid for Cancer Research to buy a next-gen sequencer. We worked out a lease agreement with Illumina for one of their [Genome Analyzers]. That sequencer was installed in June and we wanted to start offering sequencing as a service as we started the process of doing our own sequencing projects.
The interesting thing we've seen is the largest number of projects where people have asked us to generate data has been in ChIP-seq applications. In hindsight, it makes perfect sense. It’s the one application where there is no really one good competing technology. To really get that information about where transcription factors bind, you have to use genome-wide tiling arrays. The cost of one of those arrays is comparable to a lane on one of those next-gen sequencers. I think the greatest number of papers that use NGS will be in ChIP-seq or epigenomics where array technologies are too costly.
At the same time, we are generating sequence data using the Illumina GA and Helicos Bioscience's HeliScope on an individual with ovarian cancer. The good news from our perspective is the patient on whom we are generating this data is a patient for whom we also have gene-expression data. So we can go back and look at the correlation between the two using RNA-seq, whole-genome sequencing, and expression-profiling data collected using array technology. We are currently in the process of generating sequence data using HeliScope and the GA.
What are some of the main challenges with NGS?
Well, we have seen people apply for sequencing instruments and not realize that the data-management aspects are most challenging they will address. We are in a good position to do that. We haven’t faced it yet, but we have to think about whether we will store the images in the long term and how valuable they are.
It reminds me of when people were storing gel files using the Sanger technology. The truth is that I don’t think anybody ever did, but the trace files have been useful. For NGS, we'll have to reach some kind of middle ground. Saving all the images is out of the question unless they get dumped to archives or tape or DVD. But I don’t think anybody will go back and use primary data. It’s the intermediate data we'll have to store and that’s not a trivial problem. But this is why I get so excited about doing biology right now, because of the evolution of biology from being a laboratory science to an information science. Those who will move forward in making advancements in technology will be those who are the best at collecting and analyzing and storing data.
I think that the biological sciences are where physics was in the 1870s and 1880s. There is a huge transformation in our understanding of the physical world going on, all born out of scientists' ability to collect large datasets, as the generation of large bodies of data drives the acceleration of the field. We are now at that point with biology. We have technology that generates a large amount of data. The cost of generating that data is falling, so we can generate more. We are democratizing the generation of data and the mission of CCCB is to democratize the analysis of data. We want people to recognize that there is value in generating, sharing, and analyzing data effectively.