Trey Ideker of the University of California, San Diego, is revving up development of a database of molecular network models, and is encouraging the community to submit to the resource so researchers have plenty of network information to mine.
Called Cell Circuits, the open-access database is designed to “bridge the gap between databases of individual pair-wise molecular interactions and databases of validated pathways,” according to the project website. The resource is envisioned as a “clearinghouse in which theorists may distribute or revise models in need of validation and experimentalists may search for models or specific hypotheses relevant to their interests.”
Ideker told BioInform that Cell Circuits is not “quite ready for prime-time, but it’s getting there.” Ideker estimated that there is still six months to a year of development left on the database.
At the moment, Cell Circuits only holds information from around 20 publications, due to “our limited ability to curate publications from our end,” Ideker said. “We don’t have the curation firepower.”
Ultimately he said, he wants to encourage scientists to submit their own data. One option for encouraging more submissions is to take advantage of the large user base for Cytoscape, the open source network visualization tool that Ideker helped originate and now boasts more than 50,000 users.
“I can do a Bill Gates move and leverage Cytoscape to get Cell Circuits up and running just like Bill Gates uses Windows to advertise Microsoft Office … and Internet Explorer,” he said. “I can provide a hard-coded link from Cytoscape to Cell Circuits” — a capability his team already has developed, but not yet released with Cytoscape.
Ideker said that the database will fill an important niche. “We think there are good databases of interactions, there are good databases of pathways, but what is needed is a database of working pathways and working network models that aren’t yet super-well validated,” he said.
Well-curated pathways are only a fraction of all pathways that will end up being mapped, he said. “We publish lots of papers, other people publish lots of papers where supplemental table 20A will be a list of a ton of all these pathways and network modules we think are really present in the cell,” he said. “Those pathways are not yet in canonical lists of pathways, but nevertheless you would like to get them out there in the community and vetted and that is what Cell Circuits is.”
Of his own submission practices of protein network predictions he readily admits, “we have been as good as we should be getting our predictions into databases.”
In order to foster more interaction, Ideker’s team recently developed a new function for Cell Circuits called “submit my data,” which lets users upload complex files such as network models and the graphics that go along with those models. This feature is still going through preliminary user feedback and testing, he said.
In addition, full integration with Cytoscape is currently being circulated to a few users and debugged.
Scientists with network data currently need to contact model organism databases, tell them about their dataset, and about the validation of the data through experiments. “For fly predictions that would be FlyBase, in the case of yeast that would be Saccharomyces Genome Database and others, [but] the problem is that these people are over-worked and under-funded,” he said.
“We think there are good databases of interactions, there are good databases of pathways, but what is needed is a database of working pathways and working network models that aren’t yet super-well validated.”
The challenge lies not with the database curators, who are usually “very amenable to putting the data in,” he said. “It is just that they have a million other things to do.”
Ideker said that broader availability of protein network data should help advance the development of new computational methods. For example, in a recent paper in Nucleic Acids Research he and his colleagues used publicly available protein interaction data to computationally predict protein localizations.
Ideker and his colleagues relied on several molecular interaction databases such as BioGrid, the Database of Interacting Proteins, and the Saccharaomyces Genome Database as training sets for the machine-learning approach they used.
“The paper brings into focus that you can do that much better if you have knowledge of protein networks,” he said. “It was a little surprising to us that no one had extensively published on that before.”
In the study, his team predicted new localizations for 7,058 fly and 4,366 proteins. While Ideker and his colleagues did not validate all their predictions experimentally, they did use fluorescence microscopy to study several predictions in yeast and determined that “in some cases … it appears that network-based predictions can correct or complement the image readouts of high-throughput experiments.”
“Both journals and database curators understand that properly validated predictions are extremely useful,” he said. “That is why we included in our paper we included experimental evidence showing that what we are saying is true.”
Computational methods such as the one Ideker and colleagues developed may help define confidence values for data from high throughput experiments, Sandra Orchard, senior scientific database curator for IntAct and UniProt/Swiss-Prot, told BioInform.
IntAct, hosted by the European Bioinformatics Institute, is an open source database for protein interaction data and contains over 50,000 proteins and over 170,000 interactions.
Orchard said that it might be possible to divide the data in the database, concentrating for example, on low-throughput research with a focus on single interactions and for which much data has been amassed, which could be considered “high-confidence data.”
That data could serve as a training set for high-throughput datasets from yeast-2-hybrid library screens, she said, noting one could “use the algorithm as one method of predicting how confident you are in the individual reactions within these high-throughput sets.”
Orchard said that most authors who submit high-throughput data do not perform confidence analysis, “but there are many, many ways of doing confidence analyses and this could be yet another one that you could [apply] across the whole database.”
In addition, she noted, Ideker’s method is “technique independent,” so “you could apply it to almost all large data sets across the database.”