The accelerating volume of data arising from recent advances in bioinformatics and genomics has quickly made data storage a key issue for many biotech and pharmaceutical companies.
According to a recent survey conducted by Silico Research, many of these companies are looking toward data warehousing as the best approach to storing, integrating, and managing that data.
While similar to large databases in many aspects, data warehouses possess a number of unique attributes. First of all, data warehouses organize information by subject rather than function. Thus, while data from a genomic database is organized around sequence information and data from a clinical database is organized around samples and tests, a data warehouse could include the same information organized under general subjects such as disease or compound.
Data warehouses also use consistent naming conventions and encoding instructions in order to ensure integration. Unlike operational databases, the data in a warehouse generally is not subject to updates or revisions.
Silico Research found that over 77 percent of the biotechnology and pharmaceutical companies it surveyed are currently deploying at least one data warehouse. The consulting firm expects that close to 100 percent of these companies will be deploying the technology by 2003.
However, use of data warehouses on an enterprise-wide scale is still very limited, according to the study, which found that most companies are deploying the technology on opposite ends of the drug discovery pipeline while leaving the development stage largely out of the picture.
“What I think we’re finding is that the new computer technologies are less applicable to the intermediate stage between pure, huge number-crunching aspects of bioinformatics and the pure number- crunching aspects of clinical trials,” said Emmett Power, CEO of Silico Research and co-author of the report. “So the guys in the middle are still shaking their test tubes.”
Steve Gardner, vice president and CTO of Viaken Systems, based in Gaithersburg, Md., said that extending data warehousing to the entire organization would be a key component of future drug discovery success.
“When you’re validating leads, for example, you want to know about gene expression studies, metabolic pathways, genomic sequence and annotation, polymorphisms in your potential patient population, market need and medical information, patents, scientific literature — a whole bunch of data sources to build a picture about whether this is likely to be a successful target to take forward into the research pipeline,” Gardner said.
Gardner noted that the ability to put that information in the hands of decision makers “is going to be not only important for the drug discovery process, it’s going to be the drug discovery process.”
Viaken substantially increased the storage capacity it can offer its customers in a recent deal with EMC that will eventually enable the storage of up to 100 terabytes of data.
So far, Viaken has deployed eight terabytes worth of storage for its customers, which consist of three biotech companies, one genomics company, and one life sciences company.
As part of the deal, Viaken also expanded its relationship with Exodus Communications, which will house the increased volumes of data.
Sean Kinney, marketing manager for biotechnology and pharmaceutical industry at EMC, agreed that the industry’s current practice of maintaining several departmental warehouses is not an effective way of maximizing the value of the information.
“The industry is starting to move toward massive scalable data warehouses,” said Kinney, “but it really needs to continue to go there if it wants to accelerate the drug development timeline.”
A number of companies link independent data warehouses and databases together in what is known as a “virtual data warehouse” through some type of middleware such as IBM’s DiscoveryLink. While this approach is less costly than a large-scale dedicated data repository, it does have its disadvantages. These include a significant lack of speed compared to data warehouses, where the data is cleaned, sorted, and organized in order to make it more easily accessible to end users.
Virtual data warehouses are a low-cost temporary solution, according to Power, who added, “I wouldn’t wager on virtual data warehousing as the solution to anyone’s problems.”
But James Nelson, vice president of product marketing at Entigen of Sunnyvale, Calif., countered that a virtual data warehouse is the best solution in many cases.
Nelson said that Entigen’s Java-based ADAAPT (Advanced Dynamic Access, Analysis, and Personalization) technology, which enables access to disparate databases through a browser-based graphical user interface, ensures that the data being accessed is current.
“Warehouses are often being pushed into things that they’re not really good at,” Nelson said.“Warehouses are extremely good if the data you are dealing with is essentially stable. But if you’ve got data that changes daily, because it’s being updated or because new data is being generated routinely, then a warehouse is not a very good alternative.”
While Nelson said that data warehousing “has its place,” he noted that many of Entigen’s customers turned to his company’s solution after installing warehouses that did not solve all the problems they were expected to.
Gardner, however, said that Viaken offers complete database updating facilities as part of its storage solution. Using EMC’s storage area network, Gardner said, “we can [update] it once and very quickly replicate that off to all of our customers.”
And while high installation costs are often cited as a deterrent for deploying data warehouses, Kinney said that the total cost of ownership is far lower. He cited a Red Herring study conducted last year that estimated a single storage manager would be capable of managing 200 gigabytes of distributed storage, with the human aspect of that information storage approach running up to 66 percent of the total storage budget. The same individual could manage up to 1,900 gigabytes of information in a data warehouse, with the human aspect of the total cost of storage dropping to 9 percent.
However, according to Martin Sumner-Smith, president of Base4, companies that are focusing on database integration and data warehousing are missing the mark. “Access to all of the information is not the biggest roadblock for most people, at least at the raw data level,” he said.
Since most individual workers know which sources of information they need and know where they’re putting the information that they generate, “a molecular biologist really couldn’t care less about seeing the clinical trial information,” said Sumner-Smith.
Instead, Base4 focuses on project-level management of summary information resulting from researchers’ analysis so that project managers and senior management can access it.
Yet this approach would not address a situation Gardner cited in which two distributed research groups at a global pharmaceutical company were working on the same molecule for years, with no knowledge of the existence — let alone the results of — the other project. In this case a shared repository for fundamental data would have prevented the duplication of years of research.
“Information is the crown jewels of your R&D. You need to respect it,” Gardner said.
Largely due to the functional and cultural differences that serve as the primary inhibitor to cross-departmental integration in the biotech and pharmaceutical industry, Silico Research estimates that the average number of users per warehouse will remain the same, but the total number of warehouses deployed will grow by 45 percent a year through 2004.
Thus, users and vendors of data warehousing technology will have to contend with interconnectivity issues rather than scalability issues.
In order to move toward an interconnected enterprise-level data warehouse, Silico Research recommends that biotech and pharmaceutical companies engineer their local warehouses for eventual integration into a single system that links the discovery, development, and clinical trials teams, even if such an integrated system is not an immediate goal.