Web services entered the scene a few years back, accompanied by the IT industry's typical hype, but it appears that the approach may actually deliver on its promise for integrating disparate bioinformatics resources.
An informatics team at Bristol-Myers Squibb is using the approach to address some of the shortcomings of an increasingly popular method for integrating informatics applications: so-called workflow or pipelining applications, from vendors like SciTegic, InforSense, TurboWorx, and Incogen. These systems can effectively link multiple software applications into complex research workflows, but when it comes to tying them into a high-performance computing architecture, most of them fall short.
But the BMS team has found that web services may be just the thing to merge workflow and HPC. "What we're working on is actually larger in scope than just the advanced workflow pipelining tools that are now available," said Nathan Siemers, director of R&D at BMS. "I guess you could say that web services infrastructures for high-performance computing are really getting up to speed."
The BMS team is creating a multi-tiered architecture (see figure, below) that allows bio- and cheminformatics applications to run together in complex pipelines on a compute cluster, while keeping the whole integration process invisible to end-users.
BMS has enlisted the help of several vendors to develop the system. SciTegic's Pipeline Pilot is the workflow tool of choice, at the top of the stack. This sits on top of a web services layer based on the BioTeam's iNquiry software, which itself is on top of Platform's LSF cluster-management software.
Web services — XML-based standards like SOAP, WSDL, and UDDI — allow different applications to communicate with each other regardless of their operating system or programming language. "It's a way to do distributed computing and remote computing, and it's platform and language neutral, which means you can write Java, you can write Perl, you can use a pipelining tool, what have you, to solve your problems," Siemers said.
However, he noted, web services still have a number of "missing pieces." One of these is job scheduling, which is where the LSF system comes in. "You don't want the scientist sitting at the top of all of this using their favorite language or one of these pipelining tools to have to manage all [the computing resources] — you want that to be hidden," he said.
This ease of use is expected to expand the user base for the informatics system across the BMS discovery enterprise from a current level of "dozens" to somewhere "in the hundreds," Siemers said. "The whole process of interacting with the cluster, with sending a data file to it, running a job, getting an output file back, getting your data back, your results, can all be very easily hidden," he said. "If you couldn't do that, one might say, 'Why do you want to use web services?,' because it's just more complex than running something on a command line. So the power here is that you can hide that complexity with most of these tools."
The BMS team was able to take a bit of a shortcut with BioTeam's iNquiry, which has "embraced and extended some open source components and turned them into a practical and useful system for doing this," Siemers said. The core of iNquiry is a software package called PISE (Pasteur Institute Software Environment), which was developed to create web interfaces for command-line informatics packages.
"What BioTeam has done is extended that PISE infrastructure so that in addition to creating web pages that people can use, they've also created web services for every single application that's ever been wrapped in PISE," Siemers said.
The upshot, he said, is that "essentially, with no extra work, once that PISE definition is made, the web service is available" and the applications can be plugged into the workflow software and then run easily on the cluster.
The BMS team is currently working to extend the platform — both on the application side and the computational side. Siemers said that the team has used PISE to port several cheminformatics applications to the platform, and that it next plans to add more statistical analysis tools.
One integration challenge the system doesn't address is semantic integration, to ensure that a "gene" from one database or application means the same thing as a "gene" from another resource. "That's tricky," Siemens acknowledged. "The current infrastructure that we have doesn't directly deal with that issue … so it's left up to the scientists to manage that."
Siemers said that he is keeping his eye on web services-based data integration efforts like BioMoby, but BMS hasn't implemented any systems of that type yet. "It's something we'll consider in the future," he said.
So far, Siemers said, BMS considers the system to be a success, using a set of "simple metrics" based on usage patterns, such as the amount of CPU time used for particular tasks, and the total number of users tied into the system.
"It's met expectations and there's a tremendous potential for growth here," he said.
Siemers will provide further details on the BMS system at the Bio-IT World Conference in Boston this week.
— Bernadette Toner ([email protected])