The National Cancer Institute Center for Bioinformatics wants you to use its data — along with its bioinformatics tools, middleware, ontologies, vocabularies, and other resources. With last month’s 1.0 release of its caBIO set of APIs, the NCICB effectively dropped a welcome mat in front of the already-open door to its bioinformatics infrastructure toolkit.
CaBIO (Cancer Bioinformatics Infrastructure Objects) serves as the primary programming interface to a broader bioinformatics platform that the NCICB has been developing for over four years, called caCORE. While a Java-based beta version of caBIO has been available since October of last year, the 1.0 release offers a robust set of three APIs that bioinformatics programmers of varying skill levels can use to suit their needs, said Peter Covitz, director of the NCI’s bioinformatics core infrastructure.
caBIO acts as an abstraction layer that developers can use to retrieve data “in a programmatic way“ from the NCICB’s Cancer Genome Anatomy Project and Genetic Annotation Initiative, as well as 14 other sources including GenBank, Unigene, Homologene, LocusLink, Ensembl, RefSeq, BioCarta, GoldenPath, and DAS servers. This approach, which offers the choice of a J2EE, SOAP, or HTTP API, “makes bioinformatics developers extremely happy,“ according to Covitz, because they can easily plug their own programs into the different data sources. The result is a degree of flexibility that far surpasses data resources from the NCBI and other data providers who offer only a single web interface to access their data, he said. The NCICB aggregates these different data sources into a single database hosted at the NCI, and supports public access through the three APIs.
Covitz noted, however, that the caBIO data sources are not yet portable for in-house installation.
Over 40 caBIO objects in the 1.0 release represent key bioinformatics entities, such as genes, chromosomes, sequences, agents, trials, and ontologies. Developers can use the APIs to obtain information on specific objects, such as sequences affiliated with a specific gene, or related groups of objects, such as genes and proteins associated with a cellular pathway.
In addition to caBIO, the caCORE infrastructure encompasses a set of controlled vocabularies for cancer research called Enterprise Vocabulary Services (EVS) and a set of common data elements for clinical cancer research stored in the Cancer Data Standards Repository (caDSR). Covitz noted that caDSR metadata does not describe clinical trials data itself, but rather the terms used in the forms patients must fill out when enrolling in the trials. The caDSR database was migrated to a new production server in July, and more sophisticated user interfaces and tools are planned for future releases.
The caBIO interfaces are available through the NCICB’s public servers, and the underlying software is available for use at local sites. CaBIO 1.0 is released under a “homebrew“ open source license from the NCI and SAIC, Covitz said, which permits redistribution and incorporation into commercial products, but prohibits users from adding the software to third-party tools and reselling the package as a new product.
Covitz said the NCICB welcomes contributions to caBIO from the broader bioinformatics community, and would cooperate with commercial entities interested in releasing a commercial version of the software. Covitz said the NCICB is also working on developing a flexible data wrapper object in the object model that will allow users to bring up their own data in the caBIO environment.
More information on caBIO, along with full technical documentation, is available at http://ncicb.nci.nih.gov/core.
— BT