Skip to main content
Premium Trial:

Request an Annual Quote

Q&A: UC Berkeley's Steven Brenner on CAGI's Goal to Improve Phenotype Prediction


Steven Brenner 1.jpgResearchers at the University of California, Berkeley, and the University of Maryland are organizing a new "critical assessment" exercise targeted at computational tools for genotype interpretation.

The Critical Assessment of Genome Interpretation, or CAGI, is a community experiment that aims to evaluate the effectiveness of computational methods used to make predictions about the impact of genomic variants on phenotypes.

CAGI is the latest in a series of similar activities in the bioinformatics community that have been modeled after the successful Critical Assessment of Techniques for Protein Structure Prediction, or CASP, project, which evaluates methods for predicting protein structure from sequence.

Other efforts in the same vein include CAMDA (Critical Assessment of Massive Data Analysis), CAFA (Critical Assessment of Function Annotations), CAPRI (Critical Assessment of Prediction of Interactions), CASD-NMR (Critical Assessment of Automated Structure Determination by NMR), and BioCreative (Critical Assessment of Information Extraction Systems in Biology).

CAGI participants will be expected to use a variety of methods to make computational predictions of phenotypes based on genotype data. A group of assessors will compare the predictions to the correct results for each of the datasets.

For the first CAGI challenge, six datasets were collected from experimental and clinical labs at UC Berkeley; Harvard Medical School; the University of California, Irvine; the University of Utah, the University of Maryland, and Lawrence Berkeley National Laboratory.

Participants are expected to predict how amino acid mutations affect the function of an enzyme, estimate the probability of an individual with a given mutation being in a cancer or control cohort based on variations in two genes associated with breast cancer, and to predict the effect of cancer rescue mutants — mutations that reactivate suppressed cancer-causing genes.

Another dataset provides a list of phenotypes and participants have to predict the probability of ten individuals pulled from the Personal Genome Project having any of the phenotypes on the list as well as any additional ones.

A fourth dataset contains 54 breast cancer cell lines and participants are expected to predict the response of each cell line to a group of drugs. In the final challenge, participants are provided with a list of SNPs and asked to make predictions about the molecular mechanisms underlying disease risk

Initial datasets were posted on Sept. 10, predictions are due by Nov. 17, and the results will be presented at a workshop held at Berkeley on Dec. 10.

This week, Steven Brenner, one of the organizers of CAGI, spoke to BioInform about his vision for the project. Brenner is co-chairing the experiment with John Moult and Susanna Repo, a post-doctoral researcher in Brenner's lab, is the CAGI coordinator. Below is an edited version of the interview.

When did the idea for CAGI first come into being?

I can’t tell you exactly when the idea came up. It was probably about something on the order of two years ago when the idea was first being formed. It has really come together almost entirely within the last three months or so when we anted up the effort in making it happen.

Why did you choose to set this up as a community experiment?

The model which it draws upon is CASP, which had a really dramatic impact in its field. In [protein structure prediction's] earlier years, people did not know how well each of the different methods worked and would assess them in their own idiosyncratic way. Equally important is that people didn’t know what the real problems in the field were. As it turned out, most thought that if only they could get their physical simulations to run longer — for a second's worth of time as opposed to a millisecond or a microsecond — then they would have really great protein structures. CASP really had an incredibly important impact in the field in terms of letting people understand how their methods work, understanding what the real challenges were, and understanding what were the real breakthroughs in the field.

My feeling was that the field of genome interpretation is in a similar state of uncertainty where there are [several] programs out there to look at a genotype and predict what the phenotype is going to be like, but generally people do not know how these methods compare against one another. There is little sense of what distinguishes them methodologically or in terms of how they operate in practice and moreover there are many cases where the actual problems that biologists need to solve are different from the ones that the methods are trying to actually address.

One part of CAGI is motivated at trying to basically assess these different methods, learn how they work with respect to each other. Another goal is to actually put out there some of the challenges that exist that are different from what some of the methods have addressed so far and are going to be increasingly relevant as we are able to accumulate more and more genomic data.

A final role is in bringing the community together. We just had an informal gathering at the [American Society of Human Genetics annual meeting] and many people came to talk about CAGI, who all knew each other from their publications and their work but had not met in person. Bringing the people together with their diverse backgrounds, distinct approaches, and getting them to talk to each other is an important role in being able to improve what's happening within the field.

One thing I want to emphasize is that this year in particular, we call it 'pre-pro-CAGI.'
The term pre-pro is used to describe an early version of an enzyme before it is fully active and our view is that this year's version of CAGI, because its been done on very short notice — the first datasets were only made available a couple of months ago and the most recent datasets were only made available about a week ago, and so people have very little time to make their predictions and the assessors will have very little time for the assessment — is almost a dry run to learn better how we actually want to run the CAGI experiments of the future. So we know we are not going to get it right the first time but we are doing our best and then we are going to get feedback.

You mentioned that some tools in the field don't adequately address the biologist's needs. Can you give me some examples of this?

Let me note that [while] there are a handful of other methods which address other questions, most methods that exist in the field are developed to look at a single nucleotide variant and say, 'Does this have a deleterious impact either by causing a nonsynonymous mutation or a splicing change?'

All the challenges we have for the CAGI this year, other than one, address questions which are outside the realm of what people normally attempt to do with the largest fraction of the methods. My hope is that putting these datasets out there helps advertise to the community that these challenges do exist and will spur new developments for addressing those types of questions.

Are participants expected to use only methods that they have developed?

Predictors basically can use whatever methods they wish. In most cases the predictors are using their own methods while some predictors are applying their own methods alongside other methods as well. There is a possibility that some predictors will use a battery of methods they didn’t develop.

[ pagebreak ]

So it would be fair to say that part of the goal for CAGI is to serve as an incubator for new methods?

Absolutely. That’s certainly what we would love to see and it will help indicate which of these methods really have something that is distinctive, novel, and useful. You get so many methods out there in the community, and with each additional one it's hard to know if it's well worth even looking at. Their participation in this experiment ultimately will indicate what novelty they bring and whether they are able to provide useful results in areas where other methods may not have been successful.

What's been the response so far since submissions opened up?

It's hard to tell. We have had an enormous amount of enthusiasm from different groups that are very supportive. We have had quite a number of groups register on the website and are saying they want to participate but in the end one never really knows, until the submissions come in, who is going to really participate and who won't.

We have just announced the formal date for the meeting [in December] and we have already had people from the East Coast, China, and Europe all say that they are coming. So there seems to be a lot of enthusiasm in the community for it as a viable way to figure out where we stand as a community in being able to do this interpretation of genotypes and how we might proceed in the future.

What was the meeting at ASHG for and what was the outcome?

That meeting was just an informal gathering of predictors, assessors, and some people who are just observers to basically discuss what the experiment involves and to get people's feedback and advice for how we might do things differently.

Last year at ASHG we had a lot of advice and feedback which had a big impact in what we decided to include in CAGI and how we organized it. This year it was more informational and people seem to be very excited about the experiment.

Is there a plan to disseminate the more effective methods to the larger community?

Virtually all the methods we know of are published already as methods papers or in a research paper where the methods have been described so I don't think we need to do anything to actually encourage the dissemination of methods themselves. What we do plan to do is have one or more papers which describe the outcome of CAGI.

What CASP has actually done historically is have an entire issue of the journal Protein which is entirely devoted to the outcome of the CASP experiment. We don’t know yet whether CAGI will have enough participants this year to merit that but we certainly think there will be at least one article which summarizes the results and potentially a series of articles if we have a significant number of contributors.

Is there any funding for this year's challenge?

At this point, there is no funding involved. People are doing the predictions with their own resources, attending the meeting with their own resources. We have a nominal registration fee for the meeting, but really there is no funding at all.

Do you plan to apply for funding?

We are going to see how the experiment goes and whether it's worth continuing in the future. I expect it will be, but we will see what sort of support would be most appropriate both for the meeting itself and also for being able to do the collection of the datasets and dissemination to predictors and collection for assessors.

We would like to be able to provide for students and postdocs to come to the meeting and to ensure better outreach to groups which may not otherwise be likely to participate. The other thing is that for CASP, what ended up being important was having an actual group of people who worked year round organizing the datasets, building infrastructure to analyze the datasets, building tools to assist the assessors, and also providing support for the assessors to come to the meeting and making sure they have support in doing their assessment.

Generally speaking, what are some trends you are seeing in the genome interpretation field?

The main thing is that there has been a dramatically increased demand for accurate methods. On the research end, because it's possible to sequence large numbers of exomes and genomes, when you find variants, you want to know [which] one is likely to be phenotypically interesting.

There are also demands arising from clinical applications. Increasingly when an individual has what appears to be a genetic disease that is not properly diagnosed, people [turn] to sequencing the exomes or genomes of the individuals and finding variants and they want to know what to do with them.

In general, the variants that are found are of unknown significance and having these methods work better can give clues as to how to interpret them better, which ones may be more interesting to explore, and perhaps to be able to decide what direction to guide the patient.

I have been told by clinicians that they are routinely using these methods as part of their pathway for analyzing variants that they discover. I also know that the methods they are using are very unsatisfactory by and large and that they don’t tell them everything they need to know. It's either they are insufficiently sensitive or insufficiently specific. Our hope is that this community experiment will help us actually understand where the strengths lie and how they can be improved in the future.