Recent improvements in SRI International's PathoLogic pathway-prediction software have enabled the non-profit R&D organization to add 142 new pathway databases to its online BioCyc resource, bringing to 160 the total number of organisms in BioCyc.
Peter Karp, director of SRI's bioinformatics research group, said that the flood of new pathway information, which SRI produced in collaboration with the European Bioinformatics Institute, could help advance curation and annotation projects for many recently sequenced organisms. In an effort to encourage these efforts, SRI has developed a file-sharing system that researchers can use to register and distribute curated databases.
SRI plans to automatically regenerate the collection twice a year, and add new genomes as they are sequenced, "but we can't curate all these databases," Karp said. Under the adoption model, he said, "we're hoping that scientists will take over ongoing updating and curation of these databases … because these are the people who know the organisms best, and I think they're the people who are best positioned to update them on an ongoing basis."
Karp said that "about a dozen" databases have been adopted in this manner so far, including those for Mycobacterium tuberculosis, Shewanella oneidensis, and several Bordatella and Prochlorococcus species.
"We can't curate all these databases. We're hoping that scientists will take over ongoing updating and curation of these databases … because these are the people who know the organisms best, and I think they're the people who are best positioned to update them on an ongoing basis."
"The key idea is to have one group updating each database as opposed to having one group updating hundreds of databases," Karp said. He contrasted the effort with that of KEGG, which also offers several hundred databases, "but they're in charge of all the updating," he said. "I would argue that it's just not scalable to have one group trying to update that many databases. They're not experts in any of them, and I don't see how any one group can have the manpower to update hundreds of genomes."
Minoru Kanehisa, director of the KEGG project at Japan's Kyoto University, was unable to respond to BioInform's request for comment before press time. Several other experts on microbial genome annotation were also unable to comment before BioInform's deadline.
The 10-Minute Pathway Database
Karp said that SRI updated its PathoLogic software about a year ago so that it takes only 10 minutes to predict the metabolic pathways for an organism starting with its annotated genome. The SRI team can update the entire set of 160 databases in about a day and a half, Karp said.
PathoLogic predicts new pathways by matching enzymes in an annotated genome against enzymes in SRI's MetaCyc database of literature-derived metabolic pathways. PathoLogic also includes algorithms for predicting operons in bacteria.
The 142 new pathway databases are categorized as "Tier 3" databases under SRI's ranking system, which rates databases according to their level of manual curation. SRI's EcoCyc Escherichia coli database is a Tier 1 resource, for example, because it has been manually curated for more than a year, while Tier 2 databases, such as HumanCyc and others, have undergone up to four months of curation.
SRI warns on its website that the new computationally predicted databases "should be treated with due caution" because the software "is tuned to err on the side of over-predicting pathways to bring them to scientists' attention, rather than under-predicting pathways."
Karp said that the SRI team and Christos Ouzounis' team at EBI split up the work for the initial creation of the new databases. "His group tackled the eukaryotes and we tackled the non-eukaryotes," he said. SRI also created a new batch-processing mode to automate the database creation. "Previously, the only way to run the software was interactively and it took a lot of clicking to get through the creation of a single database," he said.
New Data, New Opportunities
Karp said that SRI decided to expand the number of databases in its BioCyc collection for several reasons. First of all, he said, SRI conducted a survey of its EcoCyc users, "and they told us that they wanted to see ortholog links to many more organisms."
However, he said, "We also saw it as a way to generate pathway information to extract more knowledge from sequenced genomes by making predicted pathways available for many more genomes. … I think even without curation, the resource that we're providing will illuminate aspects of these organisms that were previously unknown, but when you couple that with the adoption model, and the ability then of people to share the databases using the peer-to-peer database sharing techniques that we're making available, that will add a whole new level."
Karp said that the new data should also help advance the nascent field of pathway informatics, particularly in the area of comparative pathway analysis. "I think that is a whole unexplored frontier of bioinformatics comparative analysis of pathway maps of many different organisms," he said. "I'd like to see a whole number of people enter the area and start studying that issue."
Karp said that SRI plans to release its own comparative pathway analysis software tool in August or September, but added that "that's just the tip of the iceberg, and we and others need to develop new tools."
The new data also serves as "a challenge to the experimentalists" to validate the accuracy of computationally predicted pathways. "There's relatively little information out there to assess the accuracy of these things," Karp said.
For example, metabolomics experiments will be required to confirm the existence of predicted metabolites in predicted pathways. In addition, Karp suggested, experimentalists will need to take a closer look at the correlation between gene expression experiments and pathways. "Despite all the hype about using gene expression data to recognize pathways, I don't think anyone has yet studied how well-known pathways can be even recognized in gene expression experiments," he said.
Despite advances in computational methods such as PathoLogic, Karp said that experimental evidence remains a crucial component of high-throughput biology, and that SRI's database-adoption initiative is one step toward bringing the computational and experimental communities together. "I still believe that scientists can add a lot of value on top of any automated pipeline, and that that's really critical," he said. "We really need to combine automated processing with scientific expertise."
Bernadette Toner ([email protected])