The Johnson & Johnson Pharmaceutical Research and Development group in San Diego had the usual integration problem of how to link several disparate public and private databases, with the additional burden of a cDNA chip database for 20,000 microarray experiments. J&J scientists were looking for a way to not only integrate this data so that they could easily find and retrieve new information, but wanted to add a new level of automation to the process.
With this goal in mind, J&J’s Heng Dai and a team of six other developers created an in-house system they call GeneView, which monitors public and private data sources nightly. When the system detects that a source is updated, it automatically downloads and processes it, so that the researcher’s local copy of the data is synchronized with public sources, including LocusLink, Unigene, RefSeq, HomoloGene, OMIM, the Gene Ontology, SwissProt, and InterPro, as well as proprietary and third-party sources, such as Incyte’s LifeSeq.
The team developed a gene mapping technique that Dai said overcomes a principal obstacle in data integration: discrepancies in gene identifiers between different systems. The approach cycles between three steps — a gene identifier match, a cluster-based match, and a Blast match — to map genes from any source to a central database of genes of interest with a unique J&J identifier. Users can then track a single gene across a set of linked databases with the single identifier via a web interface. GeneView “cards” provide a single page with relevant annotation information.
A similar system is under development at the Genomics Institute of the Novartis Research Foundation, with several key differences, according to developer David Block. While Dai’s team has filed for a patent on its method, Block said his group is building its integration system with open source components such as BioSQL, BioPerl, GAME, and the Apollo genome browser. The complete system, called SymGene, will also be available under an open source license once it is completed, Block said. Novartis developers are permitted to contribute to open source projects, Block noted, adding that the company “understands that it’s developing drugs, not software.”
Symgene also preserves the structure of the original data source rather than “flattening” it in the integration process, Block said. However, the end result is the same: A non-redundant set of genes mapped to chromosomes, with annotations of interest combined in a single view.
The Novartis system is still in development, but the J&J system has already successfully annotated over 50,000 unique clones in its proprietary microarray database, according to Dai, and has successfully integrated data from Affymetrix and cDNA microarrays as well as several types of microarray analysis software. Future plans include adding integration with Lion’s SRS and an annotation alerting system.
Despite the companies’ different software development paths, their parallel solutions to the same problem indicates they have much more in common than their software distribution plans may indicate.
— BT