Bioinformatics data integration solutions can be as diverse as the types of information they are intended to bring together. But a recent meeting on the subject, Barnett International’s Bioinformatics and Data Integration conference held in Boston last week, highlighted an emerging trend in the effort to merge biology’s disparate data streams: Researchers are beginning to anchor integration projects by utilizing their growing knowledge of biological pathways.
A case in point is the approach taken by Paradigm Genetics.
According to Paradigm bioinformatics scientist Keith Allen, the traditional view of integration — gathering the data behind a unified query interface via federation or warehousing — is a “solved problem.” Paradigm designed its LIMS to readily handle this “level-zero” integration, he said. What the company wanted was to “move beyond the industry standard and actually do something with the integrated data set.”
The key to extracting knowledge from multiple data sets, according to Allen, is tying the data together via shared attributes, or “hooks.” Because pathway annotation is shared by various data sets — in particular the very different territories of gene expression data and metabolomic data — Paradigm hit upon the concept of “pathway linkage” to move seamlessly between the two.
As an illustration of the technique, which is still in the pilot project stage, Allen discussed a toxicity study that compared four human antifungal drugs applied to yeast. Gene expression and meta-bolic profiling data streams provided different — and incomplete —views of the experiment. However, using KEGG and the YPD database from Incyte Genomics subsidiary Proteome to map gene expression data and biochemical compound data onto pathways, Allen’s group was able to combine the data to detect a toxic side effect in one of the drugs that would have gone undetected using either data stream independently.
Allen noted that the pathway linkage approach alone, however, is not enough. The company also relies on “coherent data sets” to combine separate data streams in a statistically balanced manner. Common units — the number of standard deviations from a matched control —permit gene expression data to be viewed in the context of metabolomic data, and vice versa, he said.
Paradigm is developing the integration technology with an ATP grant, Allen said.
3rd Millennium Takes the Pathway, Too
Pathways are also the cornerstone for another ATP-funded integration project under development at bioinformatics consulting firm 3rd Millennium. The company’s PIMS (Pathway Information Management System) technology views pathways as “inherently integrated models,” according to Jack Pollard, principal investigator on the PIMS project.
The company has built a data model around the knowledge that pathways are nature’s mechanism for bringing together its own broad range of subcellular information streams. Objects — nucleic acids, proteins, compounds, atoms, and the like — are modeled separately from interactions between those objects — such as regulation or change — and from the context in which the interactions occur — including time, location, and phenotype. This information can then be plugged programmatically into pathway models to permit querying, visualization, and simulation of biological processes.
Like Allen, Pollard noted that the aim of bioinformatics integration isn’t simply “data exchange or navigating on gene identifiers,” but rather the ability to interpret information on the conceptual level in order to gain knowledge about the biological system under study.
An important aspect of the project, he noted, is “normalizing the semantics” of biological terminology with ontologies. The com-pany is also working with PubGene to build a MedLine index of co-occurrences of gene names and symbols for mouse, rat, and human. In another partnership, the NeuMetrix repository of 15,000 neuroscience microarray experiments that 3rd Millennium developed with the Fred Hutchinson Cancer Research Center using some components of the PIMS technology [BioInform 07-02-02] is now available online at: www.neumetrix.com/hdag-aims/exec/AIMSLogin.
NCGR Extends PathDB into Integration Arena
Another approach to putting pathway knowledge to work came from Jeffrey Blanchard of the National Center for Genome Resources, who is expanding the capabilities of the NCGR’s PathDB pathway data repository into the integration realm.
PathDB takes a data warehouse approach, bringing together cellular network data on Arabidopsis thaliana and Saccharomyces cerevisiae from public, private, and specialized research databases. “Pathways are where it all comes together,” said Blanchard, noting that a recently upgraded data model developed to support PathDB could effectively support other, broader integration projects, because it already describes biological attributes, building blocks, biochemical entities, and the interactions between those entities.
In addition, Blanchard said the PathDB group is collaborating with the developers of the NCGR’s Isys application integration project to solve the “double-edged integration problem.” Isys allows users to pass genes, proteins, compounds, and other objects between PathDB and other applications and provides automatic loading of data to and from web pages. The latest version of PathDB (www.ncgr.org/pathdb) includes the Isys integration capability.
Blanchard said that future work for the PathDB project includes adding mammalian pathway data to the yeast and Arabidopsis pathway data. With the rapidly growing knowledge and interest in pathways, the NCGR will have its work cut out for it.
— BT