Peter Park, a biostatistician at the Harvard-Partners Center for Genetics and Genomics and the Children's Hospital Informatics Program, co-authored a paper last month that generated quite a bit of interest in the bioinformatics community.
The paper, "Comparative analysis of algorithms for identifying amplifications and deletions in array-CGH data," published in the Oct. 1 issue of Bioinformatics, was the first attempt to get a handle on a growing suite of methods for analyzing data from microarray-based comparative genomic hybridization. In the paper, Park and his colleagues evaluated 11 different methods for array-CGH analysis that have been published since 2003: CGHseg, Quantreg, CLAC, GLAD, DNAcopy, aCGH, Waveslim, Stats, Charm, aCGHSmooth, and CGH-Explorer.
While the study didn't identify any overwhelming favorites, Park and his co-authors wrote that CGHseg and DNAcopy performed "consistently well."
BioInform spoke to Park about the bioinformatics challenges of array-CGH data, similarities between array-CGH analysis and gene expression analysis, and what his plans are for studying other array-based experimental methods, such as ChIP-chip and tiling arrays.
What's unique or challenging about array-CGH data from a bioinformatics perspective as compared to, say, gene expression data?
In terms of bioinformatics, the central issue is that the data has some spatial structure, so the question is how to best use the spatial structure that's in the data. So in gene expression arrays, we don't really care about where those genes are — each of the probes are basically on their own — but for array-CGH, because we are actually mapping them to the physical location of the genome, where the actual probe is actually matters. So the algorithms are a little bit different to figure out how to best use that information.
So the initial data processing steps are very similar as in expression arrays, but it's just incorporating that information into the algorithm. So the motivation for our study was that we had some array-CGH data that we had to analyze from a collaborator, and if you look at the literature, it's very confusing because there are many studies that have come up with new algorithms, and this is very recent — I would say in the past two years there have been 10 to 20 papers proposing new algorithms — but it's not clear which method performs the best because every paper compares their method against a very simple method. So we wanted to do a comprehensive study of all the available algorithms out there.
So we looked at 11 different algorithms, and a lot of them are very similar. A lot of the algorithms actually are sort of reformulations of problems from other fields — for example when people look at image analysis, they have to account for the spatial information, so these are things that were borrowed from different areas.
The paper seemed to divide these algorithms into two broad approaches — it mentioned a smoothing approach, and then estimation.
Actually, there are two components that should go into the algorithm. So you have to smooth the data first, and then you have to run the right algorithm on the smoothed data. So some people concentrated on the smoothing step, and other people — without smoothing the data properly — just concentrated on the segmentation part. So, because for each of the two steps there are many choices, so you have to do the right thing to have the best method.
What were the main take-home points from your evaluation? The paper didn't seem to identify any clear winners.
There were clearly some methods that performed well, and others that didn't. We didn't want to be too explicit. One promising approach is the hidden Markov model [aCGH], and this has been used in a variety of other bioinformatics problems. It actually didn't work very well. It's a very natural approach for this problem, but there's not just one method — there are little subtle things, a lot of parameters to tune, so I think that method, if it is implemented differently, could do much better.
What advice would you give users of these algorithms, or even people developing these algorithms, based on what you saw in the study?
One problem is that most of the algorithms are not implemented in a way that's accessible to a biologist. Most of the packages were done on R, but that's very hard to use for a biologist. And for any method to be useful, they first should do a better job of comparing against the best algorithms out there, and this is the same for expression array analysis — this was just a big problem, and still is a problem. The other thing is for them to implement the software. Otherwise, it will be very limited. Currently, there really isn't a user-friendly software that a biologist can upload and analyze their data, unlike expression arrays. And I think that's just because the field is relatively new.
What are people using now? Obviously, the field is in its infancy, but people are using these chips, and vendors are selling them, so it seems like there's a bit of a gap there for people who need to analyze the data.
Some people who happen to have good statisticians as collaborators are doing well, and others are not doing well. I think a lot of software tools are in development, but they just haven't gotten far enough yet. Certainly, those programs don't include the latest algorithms because those algorithms have just been published in the last year or two.
So it's a question of implementing them in some form that a biologist would be comfortable with using.
Yes, and actually it's the same thing with tiling arrays. We've been using some tiling arrays, and we've been getting some very exciting results, but the underlying mathematical problem is exactly the same for tiling arrays.
The spatial problem?
That's right, because we're looking for things that are binding to the DNA, and that usually covers several probes that are aligned in a row.
So would the same thing apply to ChIP-chip analysis?
So would these 11 algorithms that you looked at in this study be applicable to these other applications?
Kind of. We're actually experimenting with that. The problems are a little bit different because the regions found on ChIP-chip are much smaller than, say, the chromosomal aberrations in array-CGH, so the problems are a little bit different.
There are some other issues in ChIP-chip analysis in the basic data processing. So we've actually made some progress in this area, but we haven't published anything yet. But this is a very new field and a very exciting area for bioinformaticians.
There are two issues for tiling arrays — one is to identify these regions, and then the second part is to look for sequences that are common in these regions. So even [some chip vendors] don't have any software to analyze the data — they have the software to view the data, but you can only view one section of the genome at a time, and a biologist has to sit there and scroll through the genome. And the data is so large that you can't open the file with Excel, because it has about 400,000 probes and Excel can't open if it has that many lines. So that's a real problem for collaborators.
So, again, what are people doing?
Well, that's why a biologist approached me. Because they do these experiments, they get these huge files back, and they don't know what to do. So the biologists who have bioinformatics collaborators are doing very well, and others are not. It's a problem.
What is the impact of that on studies that have been published so far?
Even the papers that have been published in Nature and such, the analysis methods are fairly rudimentary, so it's possible that someone going back and reanalyzing the data using more sophisticated methods could find some new things. But I think the papers that are published are really excellent. I'm thinking of one paper in Cell on yeast, but there have been a number of good papers using this ChIP-chip technology, and a collaborator of mine, a biologist, went to a meeting and came back and said everyone wants to do tiling arrays.
For example, we're hoping to develop a tool where biologists can upload their data and get some things back, but I think it will just take time. It's the same thing with expression data — it took some time before there were some accepted tools.
So back to your evaluation of the array-CGH algorithms, how did you end up analyzing the data that was presented to you by your collaborators?
There are a number of things we're working on. Some of these software packages have very nice visualization tools, and others don't, so we actually tried a few methods, and if maybe three or four methods that we trust find the same things, then we have more confidence in those regions. But some had very nice visualization tools, and that was very helpful. Otherwise, it's very hard to summarize the data — there are so many potential regions.
Because these algorithms are developed for different types of arrays, you actually have to understand the algorithm, and perhaps sit down with the biologist and go through the data. Because each of these methods requires some tuning for optimal performance, so that means that you can't blindly apply an algorithm. So it requires some knowledge of the underlying method, and fine-tuning to get the best performance.
Is that why you're thinking of creating a web-based service to analyze this data for people?
I've actually gotten a lot of e-mails about this article — anywhere from people who send me their data to analyze, and I've had to politely refuse; other people want code; some people ask for some kind of service where basically they can upload the data and all these algorithms would be performed, but it's too much work because each algorithm is on a different platform, so basically we had to have the same platform to compare them. One was written as a web server, one was written as an R package, one was Java — so it's very difficult to prepare them, and this is why people, when they publish, they don't compare against other methods because it takes a long time to do this.
So I told them that we can't do that because it's too much work, and every package gets updated, so we can't keep up with updating 11 packages, but something like that would be very useful. And some kind of standard data set against which a known algorithm should be tested would be very helpful. And this is the case in many computer science fields, that there are some standard things that people use to measure their performance.
And I guess that's the same thing that happened in gene expression, where things like the Golub data set became the standard for comparisons for new algorithms.
So you used synthetic data in this particular evaluation. Is that something that you would be willing to share?
Yes. We have shared this with a bunch of people who wrote to us, and we placed it on our website so that people can download it.
For me, it was a very practical problem to begin with, and I think a lot of other people have the same problem. People wrote to us, basically asking, 'So, what should I use?' I give them maybe three or four things that we liked. Obviously a lot of people can dispute this because each algorithm requires some parameters, so someone can say, 'Hey, you didn't use our algorithm properly,' and our argument in the paper was that most biologists don't know how to tune the method, so if it requires tuning, then that's a minus for the method, right?
So it has to be relatively simple to use. But, again, people can always criticize the study.
What are the next steps for you? It sounds like there are many potential directions you could take now following this study.
There are two things, I guess. One is actually applying similar ideas to tiling arrays. The other is picking the best characteristics of some of these methods and trying to come up with a better method. Actually, a lot of these methods still do not use all the information from the data.
For example, in expression arrays, in the very beginning, a lot of studies only reported fold ratios. Five years ago, some big papers would simply list fold ratios, and then it took some time before people realized they should actually compute a p-value rather than a fold ratio. So the same thing is going to happen. There is more information from the data that is not being utilized.