When it came time to design a data integration system for Pfizer’s biologists three years ago, senior bioinformatics research investigator Robert Peitzsch took care to abide by an important design principle that many informatics professionals forget: keep it simple, stupid.
“I think sometimes we bioinformaticians get hung up on the nifty technical stuff,” Peitzsch said, “but it always comes back to the biology, and it must — that’s why it’s called bioinformatics.”
With that guiding principle clearly in view, Peitzsch and his colleagues at Pfizer’s Global Research and Development Laboratories in Groton, Conn., set out to create a lightweight, “low overhead” system that would allow Pfizer researchers to access biological information from disparate resources. Rather than funnel a dozen public and private databases into a huge data warehouse built around a relational database system, Peitzsch said that since Pfizer was mirroring the public databases in house anyway, it made more sense to leave the resources in their native format and create a navigation layer to merge the data into overview reports for each gene.
A data warehouse simply wasn’t an option, Peitzsch said, because of the daunting challenges of keeping the information up to date. The decision to create a “distributed gene annotation warehouse” not only eliminated those obstacles, but also offered several key advantages, such as keeping biological data in its original format. “If you’re using DNA Star and you want to pull in a GenBank entry, it’s expecting a GenBank entry — it’s not expecting something that’s spewed out of a relational database,” Peitzsch said.
Accessing the data directly from its original source also keeps important contextual information intact, he added: “You don’t want to get into massaging that information because you might lose some of the nuance, some of the context that you would get if you just read it straight from the database.” Finally, the system required only “minimal infrastructure” to put in place, Peitzsch said.
The navigation layer that underlies the distributed warehouse is made up of separate agents for each data source, which extract information out of each database and put it into “a very simple” XML, Peitzsch said. In order to link the different data sources, Peitzsch and his colleagues built a “thesaurus” of identifiers for each database based on the system of cross-references that the public resources already use. If the term A in one database refers to identifier B in another database, and entry C in a third database also refers to B, then links can be made between A, B, and C, he explained.
According to Peitzsch, a “nifty side effect” of the system is that “the data is always up do date — there’s no additional processing needed beyond the standard indexing that one has to do for data retrieval and keeping the thesaurus up to date.”
In addition, the database calls can be run in parallel, which speeds up the searching process considerably — it takes the same amount of time to perform 50 calls as it does to perform five, Peitzsch said.
So far, Peitzsch said he has tallied up to 1,800 hits in one day for the distributed warehouse from biologists across Pfizer’s global R&D operations. Researchers running microarray experiments find the system especially helpful, he said, because they end up with a “short” list of 50 or so genes that they need to prioritize based on the collective knowledge contained within public, proprietary, and third-party databases.
Prior to implementing the distributed warehouse, these researchers had to look up each gene, individually, across a dozen different resources. In many cases, Peitzsch said, biologists were “put off” by the wealth of genomic data available in the public domain: “They didn’t know which one to go to, they didn’t know where to start, so they didn’t even bother.”
The key to the system’s success across Pfizer, according to Peitzsch, is its simplicity: It does one thing — information merging — and it does it well. Downstream analysis and other bioinformatics processes are not part of the warehouse. “It’s an overview, so once you have that you can jump off and do more in-depth research,” he said. “If you’re going to do more computational analyses, then there are better ways for going about it than this, but for just straight merging, this works very, very well.”