WASHINGTON — As the National Cancer Institute’s Cancer Biomedical Informatics Grid leaves infancy behind, users and developers taking part in the annual caBIG meeting here this week shared frank experiences about the technical challenges of adapting legacy tools to the caBIG infrastructure, and described an evolving culture as users and developers explore software engineering routes toward best practices, including new ways of doing business.
The caBIG project is creating a “pathway for a new model in biomedicine,” said Kenneth Buetow, director of the NCI’s Center for Bioinformatics and project leader for caBIG, in his opening speech. He explained that the goal of the project is to build an information network for the cancer research community and to create standards and methods that enable interoperable components and resources.
With more than 300 software applications, more than 40 end-user applications, a “standing infrastructure” of 46 NCI-designated cancer centers and 10 NCI community cancer centers actively deploying caBIG, and a plan to move toward integration with the broader National Health Information Network, there are rich “resources on tap” for scientists to work with data from their own labs and leverage data held elsewhere, said Buetow.
caBIG is sponsored by the NCI and is administered by the NCICB. Launched as a pilot in 2004, it began its “enterprise phase” in 2007 — an effort to more broadly deploy caBIG tools, applications, and infrastructure. The federated structure now reaches beyond the US to the UK, and more recently China, India, and Latin America.
Buetow said that caBIG has achieved “tremendous accomplishments” and created an active community of interlinked individuals and institutions, but he described barriers as well, such as tight funding, fundamental disconnects in the biomedical enterprise involving communicating data across disciplinary “silos,” and the need to widen the group of stakeholders to make caBIG more accessible to a broader community.
Keynote speaker Peter Traber, president and CEO of Baylor College of Medicine, said that high-throughput science places "new pressures" on the system, especially as it must draw closer to healthcare delivery. He noted that "the bioinformatics world is quite complicated and confused in regard to standards," which can be "challenging" for an institution.
Fellow keynoter Louis Weiner, director of Georgetown University’s Lombardi Comprehensive Cancer Center, said that cultural shifts are necessary in order to energize the broader scientific community about large-scale projects such as caBIG For example, he said, scientists tend “to be rather jealous and proprietary about knowledge” because it “is what drives academic success.”
Weiner told BioInform that for caBIG and similar projects to make inroads, “we have to have success stories, major discoveries, where the winners won because they collaborated, teaching the field about the importance of working together.”
Also, there “needs to be a change in the incentive structures in the academic world,” he said, such that substantive team collaboration is recognized and rewarded.
“A brilliant biostatistician or bioinformaticist who is working in a large group needs to get the same level of recognition for academic accomplishment as a lab scientist does for identifying a new gene,” he said.
This cultural shift will not spare the informatics community. The caBIG venture is about “empowering researchers to query increasingly complex data” without needing to enlist the help of software engineers or database managers for every task, said Mark Adams, program manager for caBIG and senior associate at Booz Allen Hamilton. The consulting firm has been a caBIG contractor since the project’s launch.
“In the informatics field we have been used to converting data types, writing XML, hacking Perl, however much we love to do that, if we are to include our research and clinical colleagues, we have to come up with simpler, more effective, and more straightforward ways of getting access to that data,” he said.
A New Way for the Government to Do Business
In the last several years, NCI has taken a number of steps to recruit more commercial organizations into the caBIG community.
Recently, caBIG announced the Support Service Providers Program, which gives informatics vendors an opportunity to become licensed caBIG support service providers in the areas of help desk support, adaptation or enhancement of caBIG software, deployment support for software applications, and for creating and distributing documentation and training materials [BioInform 05-16-08].
NCI also just announced five “Knowledge Centers” designed to support caBIG tools and infrastructure for an area in which they have expertise [BioInform 06-20-08].
One of these Knowledge Centers, managed by the Mayo Clinic and Herndon, Va.-based SemanticBits, has a focus on vocabularies. The Mayo Clinic team has technical knowledge of tools in the caBIG vocabulary domain and expertise with software such as LexBIG, Semantic Media Wiki, and Protégé. SemanticBits, meantime, will develop online resources, including the knowledge base that delivers these vocabulary resources to users seeking to build their own interoperable tools and systems.
Vinay Kumar, COO of SemanticBits, told BioInform that the 3-year old company has worked on several distributed computing grid infrastructure projects for NCI and NIH, and has developed quantitative biological tools, tools for clinical trials management, and applications involving semantic interoperability in the areas of vocabularies and metadata.
One tool that SemanticBits has developed for caBIG is caTRIP, an application that lets users post a query across several caBIG data services designed using common data elements in order to link different services.
The Knowledge Centers bring experts together, he said. “If people have questions they can come to us and we will work together to address that question.”
In the next few weeks, caBIG will announce the vendors who qualify as licensed caBIG service providers under the caBIG Support Service Providers Program. Among those waiting is Fremont, Calif.-based BioPhase Systems.
“Many scientists have their data in Excel spreadsheets; we can help them migrate their data into the tools supported by caBIG,” BioPhase founder and CEO Meena Vora told BioInform. BioPhase has developed software to integrate genomic and proteomic data analysis and has the potential to integrate caBIG applications, she said.
Where Space Ends
During a session on caBIG’s Enterprise Support Network, however, some participants expressed confusion about the delineation between the Knowledge Centers and the vendor-based support program.
As Leslie Derr, caBIG’s director of community alliances, explained to BioInform, the Knowledge Centers will provide web-based assistance while support service providers will work on a fee-for-service basis to, for example, tailor an installation or perform data migration for caBIG users. “To me there are very clear distinctions there,” she said.
Knowledge Centers give researchers with expertise in a particular area of biomedicine and IT the ability to administer and monitor the caBIG infrastructure, she said. “Instead of the government maintaining expertise, [and] having that all focused within the government, we have empowered the community to provide that domain expertise.”
Miguel Buddle, an associate at Booz Allen Hamilton, told BioInform after the session that the confusion may arise from the fact that this was the first public discussion of these two new entities. “As a new concept, people are having trouble seeing that line and maybe there is more communication we have to do on that,” he said.
For high-level support, such as calling a help desk or obtaining customized training or documentation, institutions “absolutely should turn to support service providers,” he said. “The Knowledge Centers provide only a limited amount of support that is entirely web-based,” he added.
“The government doesn’t want to be in the position of competing with private enterprise in this area,” he said.
At the same time, he said he believes that this endeavor with the government sponsoring “truly open development” of projects that are then turned over to the community “is a pretty unique way of doing business,” he said. “It’s hard for us to make the transition … but it’s certainly critical for the success of caBIG for it to grow,” he said.
The Day-to-Day of caBIG
When it comes to software, adaptation and adoption are quite different beasts. While many scientists realize the value of adopting caBIG tools, and described it as a fairly straightforward process, the rubber does not hit a smooth road when it comes to adaptation, which calls for software engineering so that caBIG tools link to legacy systems.
Sometimes small solutions can make a big difference. Northwestern University Biomedical Informatics Center’s Gilbert Feng outlined to Bioinform a “bridge” he developed called caBIO2BioC, laughingly adding that it urgently needs a shorter name.
This tool builds a connection between caBIG and BioConductor such that a query in R syntax leads to a reply from the caBIO database in XML, which, through an XML parsing library, is returned in R.
At his institution, as at many others, researchers struggle to organize, integrate, and analyze their data. “Therefore, that is very important to connect BioConductor to caBIG,” Feng said.
While there are packages that claim this connection is already possible, Feng explained that universal data retrieval between caBIG and BioConductor was previously not available.
Some adaptations require more than a software tool. As Booz Allen Hamilton’s Adams indicated in a session on adaptation, the caBIG way of doing things begins with a well-established data model annotated with standardized vocabularies. This annotated information model is converted into common data elements, and then the information model can generate the application programming interface.
Outlining various design patterns of adaptation to connect a legacy tool to a caBIG tool, he explained that these patterns entail varying degrees of software engineering. Some design patterns might apply wrappers, while others can involve a message broker to transfer a message between the tools, which may be “really good” at institutions that already have a tradition of messaging with a robust HL7 V2 messaging architecture. Others, meantime, may include the use of extract transfer and load scripts and data warehousing.
As Adams’ colleague Reechik Chatterjee outlined, different design patterns are associated with different costs. For example, generating an API takes “a lot of effort” and users should keep the relationship between the API and the database in mind.
Doing the data mapping between the caBIG API and tables in a legacy database is “a considerable cost,” said Chatterjee, and requires experts to be on hand for the task.
That is the situation Jackson Laboratory faced when it became one of the first institutions to adopt caArray, a microarray data management system now in version 2.0. And that is why “mapping” was, for a while, not exactly Grace Stafford’s favorite term.
Stafford, senior bioinformatics specialist at the Jackson Lab, was responsible for the mapping project, which took 220 hours, she said.
The end result has been that the Jackson Lab’s internal database appears unchanged for researchers who use the system for tracking gene expression or genetic aberrations. Users can request their data be exported to caArray anytime.
“We were very eager to get our data exposed to the grid and build a state-of-the art data analysis environment for our cancer center investigators,” said Charles Donnelly in a presentation. He directs the Jackson Lab’s computational sciences group, which helps scientists with the development of scientific applications, statistics and other kinds of analysis, laboratory management, and also caBIG deployment.
“We went to the proverbial caBIG hardware store,” he said, to find the tools needed for the caArray adaptation, but found that they were lacking.
As it turned out, the project required domain experts, software engineers, biostatisticians, bioinformaticists, research scientists, and project managers. It began with a view to how much this adaptation was going to cost and, more importantly, anlyze it for the scientists at the lab to assure that the project was going to help with research and have scientific impact.
“That is why we do this, and not just because I am a propeller head, which I definitely am, and it is really cool computer science, but it actually needs to have scientific impact,” Donnelly said.
Jackson Lab’s internally developed database tracks investigators submitting requests, stores data, tracks the tissues, manages workflow, and presents the data to researchers. “You really don’t want to disrupt that process,” he said. CaArray, on the other hand, is more of a repository with a really strong querying capability, he told BioInform.
“That is why we do this, and not just because I am a propeller head, which I definitely am, and it is really cool computer science, but it actually needs to have scientific impact.”
“We chose a data standard that is accepted by the caArray database called MAGE-TAB,” said Donnelly. The Jackson Lab adapted a prior version of caArray that had an auxiliary program that would take MAGE-TAB files and transfer them into the MAGE object model on which caArray is built.
The key was to get the mapping right, said Stafford, and to gain an intricate understanding of both the Jackson Lab’s legacy model and the MAGE-TAB format.
“MAGE-TAB is very powerful and extensible, [but] that also means it is very complex,” she said. As part of the project, she said she needed to understand, for each field, “What do they need here, [and] how does that correspond to what we have?”
“It was one of the most interesting parts of the project, most challenging, and the most scary,” said Donnelly. “The technology we understand, the data mapping is one of the hardest things,” he said.
CaArray has since done away with the auxiliary program and directly accepts MAGE-TAB files. According to a caBIG release, caArray’s key developers include NCICB, 5AM Solutions, Science Applications Internal Corporation, NARTech, and TerpSys. Early adopters include the Jackson Lab, Columbia and Washington Universities, and Lawrence Berkeley National Laboratory.
Jackson Lab scientists are extremely interested in caBIG’s tools “if we can get them tools that offer them enhanced scientific results,” Donnelly said. What scientists are beginning to realize is that “we can integrate disparate data types based on semantically annotated information models, which is what the grid is all about, and also cross-correlate that data with human data.”
“That is where the high impact is going to come from,” he said. “Right now the scientists are very interested in the tools we have brought into the Jackson Lab, but I don’t think they have reaped the benefits of the grid yet, but it’s really coming soon.”
Among the tools that lab’s scientists are keen on is caIntegrator, a framework that lets researchers access a variety of data types from SNP analysis to clinical trials data, said Stafford.
“Obviously they are very interested in that to see what data [are] out there,” she said. “It would be even nicer if it were on the grid and we had the tools to say, ‘Here you go, you can suck the data down that you want.’” It is a few more steps than that for scientists now, but “at least it’s a beginning,” she said.
Better access to clinical data could inform researchers’ hypotheses to work on animal models, which could lead to results that would potentially inform clinicians, she said. “It’s the entire loop — that is what caBIG is about,” said Stafford.