NEW YORK (GenomeWeb) – A new community challenge is being organized to evaluate and assess methods and software for analyzing data from metagenomics experiments.
The so-called Critical Assessment of Metagenome Interpretation is being organized by researchers at the Heinrich Heine University Düsseldorf, University of Vienna, and University of Bielefeld. It is intended, according to the organizers "to evaluate [existing] methods in metagenomics independently, comprehensively, and without bias" with an eye towards helping researchers select the best tools for their analysis and interpretation tasks. It's also intended to help the community define a set of standard metrics for measuring the performance and efficacy of software developed for the field.
CAMI is the latest in a long line of community challenges organized and run by the informatics community to evaluate the plethora of tools developed and made available for tasks such as sequence assembly and protein function prediction.
This week, BioInform spoke with Alice McHardy, chair of algorithmic bioinformatics at Heinrich Heine University Düsseldorf and co-organizer of the CAMI initiative, about plans for CAMI and where things currently stand in terms of preparing for the first challenge. What follows is an edited version of the conversation.
Let's start off with some background. How did this challenge come about and what are your goals for it?
My lab is involved in generating computational tools for the analysis of metagenomes. Thomas Rattei uses a lot of different tools in the field and Alex Sczyrba also works on computational methods in the metagenomics field [Editor's note: Rattei and Sczyrba are the other CAMI co-organizers]. One of the obvious things to us and others [who either] use tools in the field or develop tools is that currently it's very difficult for users to identify the best tools for a particular task. Once scientists have generated metagenomic sequence data, they have different things that they want to learn from this data and they have a hard time finding out which the best tools are for these tasks. A publication describing a computational method for metagenomics usually evaluates … a tool for one or two particular scenarios, which may not be representative for all scenarios. Furthermore, a lot of different performance measures are in use, which makes comparisons between different studies even more difficult. Understanding what's the best tool for a particular task is very complicated for [users] by looking at the literature.
Developers on the other hand [spend] a lot of effort in benchmarking their tool against existing algorithms. You submit a paper and you include a comparison against two or three other methods and then usually you get reviews that say 'this tool just appeared, can you include it' … and then you do that and then you get [another] review saying 'there is now a new version [of the same tool], how about comparing against that?' That takes a lot of resources and the value in the end may still be limited. It is also takes substantial expertise to design realistic and informative evaluation experiments and data sets, as microbial communities are very complex. There are many pitfalls, which can make an experiment uninformative, such as including data that were used to infer model parameters in a test data set, reporting results for only a part of the test data set, or generating a simulated data set that is unrealistically similar to the available sequenced isolate genomes used as reference data. There could be more that we may not even be aware of.
What we need is a kind of general evaluation of the existing tools that are there on data sets that represent different commonly used experimental setups and are as realistic as possible. That was the baseline. We wanted to [design] a competition where all the method developers can take part so we can do an extensive evaluation of all the methods that are there and also decide on which performance measures we should use and apply to our tools so they really measure what's important for different user tasks.
What kinds of metagenome analysis packages are currently available?
It's a very active field. There are three tasks where a lot of tool development is going on. The first is taxonomic binning … where the taxon can be a species or a higher ranking group of organisms and if you resolve it down to individual strains, this would be equivalent to trying to recover all the sequence fragments from one strain from [the] mixed sequence sample … and [putting] all the sequences that come just from one microbe into one taxonomic bin.
Another purpose is to get an estimate of how abundant the different taxa are in the microbial community… to say, for instance, there is a large proportion of alphaproteobacteria in there. That’s called taxonomic profiling.
Then there is another task which is to reconstruct longer contigs of sequence from the read data. For metagenomics, because it's a mix of sequence data from a mix of different strains, the algorithms need to be adapted to that. So algorithms developed for genome sequence assembly cannot straightforwardly be applied to metagenome datasets with very good results.
Since you proposed CAMI, has there been much interest from the community in participating in the challenge?
We started to discuss this with the community in March of this year. There was a metagenomics meeting at the Newton Institute for Mathematical Sciences in Cambridge. One of the focuses at this program was the CAMI competition. We used this to get a lot of the participating methods developers involved and to [get] their feedback.
People have been very responsive. Eddy Rubin, the head of the Joint Genome Institute (JGI), decided to support the initiative. And the JGI will now be contributing data for the challenge. Then we have another large contributor of genomes from an initiative lead by Paul Schulze-Lefert in Cologne from the Max Planck Institute there.
One of the tasks in our challenge is that we want to generate simulated metagenome datasets and … put a lot of effort into making them as realistic as possible. For that to be the case, they will be generated from sequence data that is not published yet. For that the JGI has agreed to generate genome sequence data for us and also the Max Planck institute will be providing unpublished data. Overall this is a commitment of around 1,200 microbial genome sequences that are being provided to us for the purpose of this competition. There are some other contributors currently looking into this too. To minimize the possibility that the data is leaked … we also decided that when we have all the genome data, just a very small number of people will actually participate in generating the simulated datasets. I and the [co-organizers] and a few others are currently working on preparing software for generating these simulated datasets using other data because we don't want to look at the real data before.
We also have a Google Plus group [that] method developers can join and contribute. We have been regularly [holding] phone conferences where people from this group can … hear about the state of the preparation at the moment and provide feedback.
How will the first CAMI contest be structured?
Three categories of tools will be tested: the taxonomic binners, the taxonomic profilers, and the assembly tools. We think about also providing different kinds of simulated datasets, namely datasets that are commonly generated by experimentalists. So we were thinking of samples with differential taxon abundances and a time series data set. We have some initial plans on this but we also want to discuss this at a roundtable held at the [International Society for Microbial Ecology] conference that's coming up in late August in Seoul. We will be asking there for more community feedback.
Have you settled on metrics for evaluating the results?
We are collecting scripts with existing metrics from method developers. Then we will decide on a set of metrics to use. There will probably be several metrics that measure similar things and then we will select [those] that are relevant to highlight different aspects of the performance. This is currently in progress [but] at the moment, we are focusing strongly on the simulated metagenome generator and generating the data for the competition.
Are you still on track to kickoff the challenge in October this year?
I can't tell you exactly if we will open the competition in October yet. This depends a bit on the speed with which the genomes are generated. It will likely still be within this year.
One thing that I didn't mention is that another goal for us is to define output formats that could become a standard for the different tools in the field, such as for the taxonomic binning and profiling methods. This should also facilitate tool comparisons in the future for developers.
We welcome everybody who wants to participate and we especially encourage all the methods developers to participate. You can register on our website for the competition. It's worthwhile to do it; you will get access to very interesting simulated data sets. We are also planning to have a joint publication with all the developers that have taken part in the competition if they would like to disclose their results and be part of it. We can't say yet where this will be published, but the issue is certainly of great importance for the entire field.