Bioinformaticists faced with a proliferating number of data types and formats have struggled for years to bring it all together, and vendors have been quick to jump in to meet the obvious demand for effective integration strategies.
But demand, no matter how strong, is rarely an indicator of success in the bioinformatics marketplace. Entigen’s Adaapt data integration technology didn’t keep the company in business. Other integration solutions, such as Synomics’ Alliance technology and NetGenics’ DiscoveryCenter, still live on under the auspices of Accelrys and Lion Bioscience, respectively, but sales of these tools were slow enough to force their original developers into less-than-favorable acquisition deals.
Over the last year, a new generation of data integration tools has emerged from a field of vendors claiming to have the problem solved. But as their predecessors have proven, the success of these companies will depend on far more than the capabilities of their technology. A wary user community burnt by unsuccessful integration solutions has proven to be the biggest challenge most vendors face today. Furthermore, as the landscape of integration tools continues to grow, users are finding it difficult to distinguish between the solutions available to them: A full spectrum of “n-tier” solutions has emerged, claiming to offer a widening “middle layer” that links an underlying layer of hardware and data to a top-level application layer. But while their PowerPoint slides may appear the same, vendors in the bioinformatics data integration arena say there’s much more stratification in that fuzzy middle layer than meets the eye, and that it’s these subtle differences that set them apart from one another.
Integrating from the Bottom up
The workhorse of most integration approaches is the middleware — generally depicted just above the data and hardware layer in the familiar stack architecture — that does the dirty work of bringing disparate data sources to one place. But although many vendors call their products middleware, Brian Donnelly, CEO of GeneticXchange, said this term can often be misleading. True middleware, according to Donnelly, is flexible and can run “between any data source and any application,” unlike some solutions that offer “predefined route maps” between selected databases and applications or “pre-canned” integrated databases.
Furthermore, according to Donnelly, middleware is not user-friendly. “It’s designed for true programmers,” he said. His company’s product, K1, isn’t marketed to end-users. “We’re plumbers,” explained Donnelly. “We allow you to have the tools if you want to build your own building. If you want to buy a whole building you go to a realtor.”
K1 is based on a mediator-wrapper approach that permits data to remain within its source and eliminates the need for data warehousing. Data federation approaches like K1 “wrap” data sources with a single language, which a mediator then uses to coordinate SQL queries against a global model. Ideally, the federated approach is invisible to the end user, who doesn’t even need to know what data sources were used to respond to a query.
K1 is the commercial offspring of the Kleisli project at the University of Pennsylvania, which was developed by GeneticXchange founder Lim Soon Wong and his colleagues when he was a graduate student there. Penn’s project has since been rechristened “K2,” although development has slowed because GeneticXchange offers an enhanced version of the software to academic groups for free.
GeneticXchange claims that its bare-bones technology doesn’t compete with other vendors’ products, which are for the most part more fully developed solutions. For example, most of the company’s customers also use Lion’s SRS for flat file querying, Donnelly said, but “use us for ad hoc querying.” He added that K1 could be used as the foundation for more fully developed integration systems such as Acero’s Genomics Knowledge platform or Lion’s DiscoveryCenter as well as for in-house integration projects. The company is currently in discussions with instrument and chip vendors regarding plans to embed K1 within their software products as well.
But while its direct competitors may be few, GeneticXchange does face one formidable challenger: IBM’s DiscoveryLink middleware, which the company launched last year as part of its life sciences business initiative. However, Donnelly doesn’t perceive the computer giant as a threat, largely because IBM is primarily targeting large pharma for its product while GeneticXchange is sticking to biotech. In addition, Donnelly noted that there are over 70 types of wrappers written for K1 so far, and additional wrappers can be written in “half a day.”
IBM agreed that there’s room for multiple vendors in the middleware space. Although DiscoveryLink is marketed as a standalone product, Sharon Nunes, IBM’s director of life sciences solution development, said the company is now pushing its recently launched Life Sciences Framework — which encompasses DiscoveryLink as part of a broader life science IT architecture — as the best approach for “integration at the enterprise level.” Even a standalone DiscoveryLink buy, which comprises the DB2 database, a relational connect, and a life sciences data connect, requires a full-scale services contract with the company.
Nunes agreed with Donnelly that IBM has so far been targeting pharma for DiscoveryLink, but noted that “a fair number” of biotechs and universities are also interested in the product. And despite some concerns that the DB2 requirement may put off the estimated 90 percent of biotechs who are currently running Oracle, Nunes noted that the total ownership costs of DB2 are “one-third the cost of Oracle,” and that the query optimizer in DiscoveryLink can even speed queries on Oracle databases.
And while IBM has yet to catch up to GeneticXchange’s speed in writing new wrappers for DiscoveryLink, the company’s partnership with Lion provided instant access to more than 500 key biological data sources available through a single SRS wrapper.
Moving up the Stack
For its part, Lion is expanding the capabilities of SRS beyond the flat file domain for which the technology is best known into relational databases, XML integration, and application integration. Simon Beaulah, product manager for SRS, said a new release of the product is expected within the next few months that will support Oracle and MySQL. In addition, the company is integrating SRS into the broader DiscoveryCenter technology picked up through its acquisition of NetGenics. A complete SRS/DiscoveryCenter solution should be available by the fall of this year, according to Lion, but the company was unable to disclose further details about the combined product.
NetGenics positioned DiscoveryCenter as more of an enterprise-scale “collaborative solution” than an integration product, according to product manager Mike Bush, and Lion’s approach is the same for the time being. DiscoveryCenter’s strength lies in its ability to allow workgroups separated geographically to share data and research through a desktop client, according to Bush. Users can identify “favorite” genes or proteins and receive updates on any changes or new findings made on that data, either in the public domain or elsewhere in the enterprise. IBM’s DiscoveryLink can be used as a data integration layer beneath DiscoveryCenter, but is not an essential component of the system, Bush said.
Another differentiator of DiscoveryCenter, according to Bush, is its ability to integrate public and third-party applications in addition to data, which moves the company’s solution a step north in the middle-tier hierarchy of tools. So far, Spotfire’s applications have been integrated into DiscoveryCenter and other third-party tools have been added on a custom basis.
Another middle-tier dweller, Acero, also offers a solution to integrate data and applications, but adds compute farm management and text searching into the mix as well. Acero CEO Jim Holt said his company’s solution, the Genomics Knowledge Platform, is unique in its use of a biological object model, which maps disparate data, applications, and computational components to one another using “hundreds of scientific objects that understand their relationships and connections and inheritance with one another.” The object-oriented approach, Holt said, permits easy modification “when you update or change or plug in something new or when you add mass quantities of researchers to the system.”
The object model was developed by Incyte Genomics researchers in an effort to improve its database efforts. The key, according to Holt, was Incyte’s decision to “model science correctly.” If the company had chosen to build the model around its own data, “then I don’t think this would be the powerful platform that it is,” said Holt.
Acero doesn’t stress wrapper writing, but intends to partner with content and application providers to link their resources into the system, either by writing applications within GKP’s desktop environment or to the company’s API.
What Do Customers Want?
But as the turbulent bioinformatics market continues to demonstrate, even the best technology is no guarantee of a successful business, and the data integration field is a prime example of the biggest obstacle most vendors face today — in-house development.
Both GeneticXchange’s Donnelly and Acero’s Holt concede that their biggest competitors are in-house integrators writing their own Perl- or Java-based scripts. In many cases, even potential customers who are attracted to a vendor’s technology won’t buy because they’ve invested too much time and money into their own solution. Furthermore, according to Donnelly, there’s the further challenge of “guilt by association” with failed integration companies of yesteryear.
Richard Scott, head of cheminformatics at De Novo Pharmaceuticals, told BioInform that “build” won out over “buy” for his company because “we wanted flexibility for our choices for third-party applications for each type of data domain.” Scott put together an integration team at the young company to build a “virtual warehouse” based on Oracle, an off-the- shelf application server, and Java.
Scott said De Novo shopped around for some commercial solutions, and based its purchasing decisions based on technology that adhered to open standards. Tools based on proprietary formats were not an option, he said.
Scott summed up the demands that helped him construct De Novo’s system: “Users want unfettered access to their data that they’re interested in. They don’t want complex GUIs. They want their tools to be robust, and they want to get to where they want very quickly.”
Success in the marketplace for new data integration vendors will depend not only on their ability to meet these needs, but to do it better and cheaper than companies can on their own. Holt said that Acero is ready for that challenge, noting that it’s only a matter of time before customers paying highly skilled bioinformaticists to write Java-based integrations “look at our costs versus what they will spend in maintenance and realize they can put these people back into their core strengths of doing science.”