Data integration and pathway analysis were the hot topics at the Jackson Laboratory/TIGR Computational Genomics conference, held in Cambridge, Mass., Oct. 8-11.
Data integration is “sort of the apple pie” of bioinformatics — everybody agrees that it’s a good thing, said Mark Gerstein, principal investigator of the bioinformatics group at Yale University. But for many researchers, “integration” just means that they consulted several different data sources before arriving at a hypothesis, Gerstein said. “Bioinformatics can add a lot of value by putting together a lot of different types of data in non-trivial ways,” he said, but the challenge is unifying those data types in a “mathematical formalism” rather than merely as a nebulous collection of information.
Eric Lander, director of the Whitehead Institute/MIT Center for Genome Research, provided an illustration of the new biological discoveries that can emerge from the integration of various data types. In a recent project that studied type 2 diabetes, initial microarray experiments using muscle biopsy samples of 17 diabetics and 18 normal subjects were disappointing. “We got bupkis,” Lander said — no genes were significantly differentially expressed between the two groups. Opting to look at “gene sets” instead of single genes, the Whitehead team combined information from manually curated pathways and clusters in the public domain, as well as from textbooks, the scientific literature, LocusLink, and Affymetrix NetAffx annotations to create gene sets for several key pathways. When they reclustered the data from the microarray experiment by gene set instead of by single genes, one group stood out, Lander said: those genes associated with oxidative phosphorylation.
The experience, he said, taught the team that it’s often more important to look for modest changes across many genes, “rather than a large fold change in one gene.” The Whitehead researchers are also using data from cell-based models and human genetic studies as part of this research. The knowledge gained by integrating new data types into the process, Lander said, highlights the fact that up until now, “We haven’t been very sensitive listeners to what our data are telling us.”
Data providers also took the stage to discuss their approaches to integration. Simon Twigger of the Medical College of Wisconsin described how the Rat Genome Database project is working to integrate phenotype data with genomic data. One resource, PhysGen (http://pga.mcw.edu/) stores around 9,000 physiological data points for 19 “consomic” rat strains — in which single chromosomes are swapped between two rat strains in order to study the effect of the chromosome on a standardized background. Genotypic information is available for each of the strains as well, Twigger said, and the team is currently combining data from rat microarray experiments with information on strains and phenotype.
Tatiana Tatusova from the National Center for Biomedical Information explained how NCBI is relying on the recently upgraded Entrez interface (http://www.ncbi. nlm.nih.gov/Entrez/) to integrate its various data resources by providing “a single engine query to search across all the databases.” Tatusova said that a new database, called Entrez Gene, would be available in the next few months to replace the information that is currently in LocusLink. Currently, Tatusova said, LocusLink doesn’t use the Entrez interface, so the new resource will help unify that data with NCBI’s other databases. In addition, she said, LocusLink is currently “biased toward eukaryotes,” while Entrez Gene will offer more coverage across species.
Pathways via Integration
Francesca Ciccarelli, from Peer Bork’s group at the European Molecular Biology Laboratory, described a method that relies on integrated data sets to infer functional links between metabolic pathways. STRING (Search Tool for the Retrieval of Interacting Genes/Proteins, http://www.bork.embl-heidelberg.de/STRING/) is a precomputed database that uses gene neighborhood information in combination with phylogenetic profiling and gene fusion information to predict functional associations among genes and proteins. Associations are provided along with a likelihood score so that users can set more or less stringent parameters. The database currently contains 356,775 genes in 110 prokaryotic genomes. In a recent study, Ciccarelli said, STRING identified 38 novel associations that were not in the KEGG database or the scientific literature.
Another pathway analysis tool, from Ming Yi at the University of Texas Southwestern Medical Center, combines cluster-based gene expression analysis with defined biological pathways. The system, called WholePathwayScope (http://wps.swmed.edu/), stores known pathways in files that users can search for specific experimental conditions. After clustering the list of genes from an experiment, the user can map them onto the pathway, Yi said.
IBM’s Life Sciences group is also moving into pathway data analysis. Barbara Eckman, senior consulting IT architect for IBM Life Sciences, provided a glimpse of a prototype system the company is developing for systems biology data management, which uses connection graphs to represent biological pathways. The system is built on IBM’s DB2 relational database, which is “not sufficient alone” for managing systems biology data, Eckman said, but has proved useful for storing large data sets to represent connection graphs so that you can “retrieve subgraphs and do operations outside the database.” The system is currently being tested at a “large West Coast” research institute to predict networks in microbial genomes, Eckman said.