In May, organizers of the fifth annual Dialogue on Reverse Engineering Assessment and Methods opened up participation for the annual evaluation of computational systems biology methods (BI 05/28/2010).
Since the first DREAM conference was held in 2006, the meeting's main objective has been to “catalyze the interaction between experiment and theory in the area of cellular network inference and quantitative model building in systems biology,” according to the project's website.
This year marks the fifth year of the conference and the fourth set of challenges. DREAM 5 will include four challenges: the Epitope-Antibody Recognition Challenge, the TF-DNA Motif Recognition Challenge, the Systems Genetics Challenge, and the Network Inference Challenge. Datasets for each challenge can be downloaded from the DREAM 5 website.
Winners of previous DREAM challenges have included teams from Yale University, the Genome Institute of Singapore, and Columbia University.
This week, BioInform spoke with Gustavo Stolovitzky, a scientist at IBM Research and one of the founders of the DREAM project, about past, present, and future challenges and the evolution of the reverse engineering field. Below is an edited transcript of the conversation.
How has the field of reverse engineering as a whole evolved since DREAM began? What have been the major improvements? What challenges still remain?
I think [the field] has been evolving. Because there are other datasets that are available and other data biotechnologies that are available it's very difficult to determine what role DREAM has played in the evolution. I think that right now we have created a robust set of gold standards that researchers are checking against each time they want to [know] how well their algorithms are doing. They can compare [the algorithms] with the best in a particular challenge from previous years.
The other thing that we have learned is that the combination of some perturbations and some algorithms appear to give a lot more intuition and correct answers when it comes to determining which genes interact with which genes.
For example, each time we give systematic mutants as datasets in some of our network inference challenges, we observed that the best teams sometimes are the ones who make the very simplest predictions, which is simply, 'If this changes, what else changes most,' and basically assembling that information. That seems to produce a lot more information than what we call multi-factorial perturbations.
One other thing that we have observed is that it seems that even when an algorithm is doing pretty well, when it's combined with other algorithms that are doing well, the aggregation of the algorithms produces results that are better than any of the algorithms individually. That's interesting because that means that rather than trying to see whether my algorithm is the best, what I should try to do is to try to find partner algorithms that work best in complementarity with mine in order to get the most out of the data.
Give me a little background on how the project began and on past challenges.
The idea is to expose data, for which we know what the results of the analysis should be, [to] participants and [let them] make predictions that allow us to evaluate or assess the accuracy of the methods to analyze the data. The data varies from year to year. In 2007, we started off with data that had to do with reconstructing networks of either protein-protein or gene regulatory networks. Since [then] we explored other [types of] systems biology data. That includes not just big systems and the inference of networks but predictions of what would happen if perturbation occurred in a system.
This year we have four challenges. One of the challenges is predicting binding specificity of peptide-antibody interaction. Basically we are asking, of this set of several thousand peptides, which ones are going to be recognized by typical antibodies in our blood streams. The second challenge is something similar. We are trying to find out the extent to which we can predict the binding of a transcription factor towards regulatory elements using protein-binding microarrays.
[The] third is a systems genetics challenge and we are trying to understand the extent to which we can leverage genetic information and gene expression to predict phenotypes. In this particular case, we have a dataset from soybean. We have a lot of [soybean] microarrays and a lot of recombinant inbred lines, which are lines that are basically homozygous in all loci. What we are asking is, 'Can you predict the phenotype?,' which in this case is how susceptible to some pathogens these plants are. The last challenge is on network inference. We are asking for the prediction of the gene regulatory networks of three organisms and a fourth in silico network.
These challenges are independent and people can participate independently. There are separate communities that work on protein and antibody interactions and on network inference so typically people that participate in one challenge do not participate in other challenges.
So far, participation has been very encouraging. In the network inference challenge, we have 142 downloads, in the systems genetics challenge we have [about] 50 downloads, [about] 130 in the challenge in the transcription factor-DNA motif challenge, and about a hundred or so in the epitope-antibody recognition challenge.
[ pagebreak ]
Have you seen an increase in the number of entries since you began?
The number of teams that participated in previous years has grown systematically. In DREAM 2 we had 36 teams, in DREAM 3 we had 40, and in DREAM 4 we had 53 teams.
We have been kind of putting some pressure on the community in the sense that as opposed to other challenges that occur every so often or every other year, we have been releasing challenges every year.
We have also been offered data. In the first DREAM, we had to [ask researchers] to provide data. This year we practically didn't have to ask for any dataset because we had more datasets than we could use.
Have winning entries from previous challenges been adopted by the larger research community?
These things percolate in the community slowly. I think that we are giving the winning entries a forum to [publish] their algorithms and results in a publication. For example, in PLoS One there is an online collection of articles pertaining to DREAM 3 and we are creating an equivalent for DREAM 4.
I would say that it's a little bit too soon to expect dramatic change because people tend to be very attached to the methods that they develop. If they see that their method is not working well they will try to improve it rather than adopt another one.
[One thing that's] on our radar but we haven't had the time to do it is to create a repository of algorithms from where users could pick and choose what algorithms would work best for their data. I believe that will facilitate the dissemination of algorithms that are doing better specifically in our challenges.
In the past there has been significant participation from researchers in academia. Has industry participation grown as well?
Not in sufficient numbers. I think that most of the participants are mainly academics that come from all over the world.
Have you made any changes to this year's DREAM?
Yes, this year's challenges are different from last year's, for example. We are trying to create a variety of challenges [but] we try also to keep some continuity. For example, the network inference challenges have changed a little bit because before we had smaller networks of 10, 50, and 100 nodes at the most that we were probing; now we have networks in the hundreds of nodes. That is a considerable change because there are some algorithms that will not be able to run because they only run for small networks. But overall the nature of the questions we are asking are similar.
How soon can teams start submitting entries?
The deadline for submission is Sept. 20 so we will probably be open for submissions about two weeks earlier.
There was some talk about whether you would release the names of teams that perform badly. What is your decision on that issue?
We prefer not to release the names of the [groups that] don't do well because it somehow stigmatizes those groups. This should be a community effort that helps the community create better ways of analyzing data. It might not serve that purpose to [point out which groups] didn't do well.
There are some reasons why we should. We should [let researchers know which] methods don't work. We try to describe those methods that don't seem to work very well in particular challenges in our overview articles without naming the specific researchers.
This is the fifth year of the DREAM project. The original NIH funding was for five years. Will there be a DREAM 6?
We are thinking of getting some funding. In general, we mostly used the funding for the curation of the website and for the conference. It's a very inexpensive operation. I think that we could continue without much funding from external sources if we continue to use the platforms from Columbia and the goodwill from data producers and the support from IBM.