The pathway database universe is constantly expanding, but it appears that there’s always room for one more. Later this month, the National Cancer Institute will launch the Pathway Interaction Database, an offshoot of its current Cancer Molecular Analysis Project that aims to improve the quality of currently available pathway data.
The database, developed in collaboration with the Nature Publishing Group, is built upon KEGG and BioCarta pathways. A prototype version, available here, currently contains around 5,000 interactions, 2,200 proteins, 350 complexes, and 2,300 small molecules derived from 99 KEGG metabolic pathways and 122 BioCarta signaling pathways.
The complete version of the database, scheduled to launch before the end of November, will be available at a dedicated website.
BioInform this week spoke to Carl Schaefer, NCICB bioinformatics scientist and project lead, to get a bit more detail about the project.
Can you tell me about the Pathway Interaction Database and the motivation for creating it?
For a long time, the NCI has had the Cancer Molecular Analysis Project — CMAP — website, and one of the things we did on that site was to take predefined pathways from KEGG metabolic pathways and the BioCarta signaling pathway diagrams, and we decomposed them into individual interactions. This allows people to access and create novel networks out of the interactions that underlie the predefined pathways.
So we created the database that holds these as well as some web applications that allow users to query them and produce graphic representations of either the predefined pathways or novel pathways, and also to do some sorts of analysis on these.
One of the issues with the BioCarta diagrams is that they are not carefully curated. They were posted on BioCarta’s website by experts who were sort of self-appointed. That doesn’t mean they weren’t good. It just means that they were self-nominated and the diagrams were not subjected to formal review of any kind.
And then there were some other limitations. There were very few citations to the literature backing up the diagrams. And there were no evidence annotations — the BioCarta diagrams did not say for each interaction what sort of evidence there was that allowed us to claim this.
They in general did not indicate post-translational modifications on proteins, and in general, the things that represented proteins were tied to gene identifiers rather than just specific protein identifiers.
So NCI has contracted with Nature Publishing Group to curate some signaling pathways to correct for all these defects. They are being reviewed by experts. There will be citations to the literature at the level of the individual interactions. There will be evidence codes put on the various interactions. We’ll be indicating physical post-translational modifications, and we’ll be tying the protein entities to protein IDs rather than simply to gene IDs.
In addition to that, the website is going to be made considerably more user-friendly. The original site was put up there without a great deal of care to how easy it was for folks to use.
Will this curation effort be limited to the information that’s already in BioCarta?
Not really. What we’re going to do is keep the original BioCarta data as it is, and that will be available on the new site, but we will be adding the curated data — some of which will overlap with the existing BioCarta data, and some of which will be new and different.
We are not going to go about recreating each individual BioCarta diagram, and there’s a good reason for this. The original set of diagrams actually have a lot of overlap. If you go into the set of BioCarta you can find maybe 20 different pathways, all of which have to do with some aspect of apoptosis, programmed cell death. So there wasn’t a lot of sense in trying to just stay with the original structure imposed by these diagrams, which, as I said, were not put up there representing any one single overview of all biological processes. They were just put up by people who had an interest in this aspect, an interest in that aspect. And the result of that is that there is a certain amount of overlap among those diagrams.
So we aren’t feeling compelled to just fix up that set. We are keeping that set because in fact some people have found it useful, but we are adding a whole new set of more highly curated and, we hope, more valuable data.
Why did you select Nature Publishing Group to do the curation work?
Given their very well-respected place in scientific publications, their high standard of accuracy in their publications, their heavy involvement in signaling data, for example — they’re also involved in the Alliance for Cell Signaling database — it seemed like a very logical partnership.
What are your plans for updating the database after it launches?
We’ll be updating it periodically, but the exact period has not been determined yet. It may be as infrequent as once a quarter, it may be more frequent than that. But the curation effort will continue. The data that’s available at launch will definitely be expanded on in the future.
Are there other databases available that contain similar information, or would you consider this to be unique among available pathway databases?
The database that it’s probably closest in spirit to would be Reactome. There are a number of differences. Reactome is attempting to cover both metabolic and signaling pathways, while we are concentrating at this point on signaling pathways, although there’s no reason that we couldn’t include metabolic. In general, a lot of cancer research has found the signaling pathways to be a more fruitful area of investigation. That’s one reason why we’re doing that.
We are not including data from organisms other than human. Reactome says that they’re basically human biology, but they include a great deal of biology from other organisms and the reason they do that is because a lot that we know in human is inferred from orthologous data in other organisms.
A large difference is that Reactome is basically set up to give you views into predefined pathways, and as I mentioned earlier, our database is actually structured as a set of interactions and we can sort of assemble representations of novel pathways.
For example, you might be a cancer researcher who has a list of genes that have been found to be mutated in a particular cancer phenotype, and you might wish to know, ‘Is there any sort of functional coherence or connectivity among this set of mutated proteins?’ So using the Pathway Interaction Database, you can feed in this list of interesting proteins and ask to have networks constructed, which attempt to connect these proteins with each other.
So the resulting network may not — in fact, probably won’t — be exactly any of the predefined pathways, but it does give you a view into the functional connectivity among these proteins, and it may therefore give you a clue to what sort of biological mechanisms the cancer phenotype is using.
Does this database support any available pathway data standards?
We’ll be exporting BioPax level 2.