A pair of concurrent Cambridge Healthtech Institute meetings held in Baltimore, Md., last week highlighted the fuzzy line that divides data integration and data analysis at biopharmaceutical companies today: As speakers in the Data Visualization and Interpretation meeting were pointing out that effective data analysis isn’t possible without ready access to heterogeneous information, their counterparts down the hall in Data Integration for the Pharmaceutical Industry were explaining how the latest integrated architecture isn’t of any use unless bench scientists can manipulate the data to get results.
The two perspectives provided some consensus on the requirements for a fully outfitted research informatics architecture, even if they differed a bit on the details of how to put it together. Researchers everywhere want the same thing: instant access to untold amounts of information — from sequence data and chemical structures to text in the scientific literature and patent databases — along with easy-to-use tools to mine it all from a single interface in as little time as possible.
John Weinstein, head of the genomics and bioinformatics research group at the National Cancer Institute and avid curator of the growing “omics” lexicon, even trotted out a new term to describe the work his lab is doing to simultaneously integrate and analyze heterogeneous types of biological data — integromics.
To Federate or to Assimilate?
For most large research IT groups, the first step in building this integromics infrastructure is planning the underlying architecture to integrate internal and external databases. There are essentially two choices: a data warehouse, or an “assimilated,” approach, in which external data is installed behind company firewalls along with internal data in large repositories; or a distributed, or “federated,” approach, which uses middleware to fetch only those bits of data of interest from external resources that stay put.
Judging from the speakers at the CHI meeting, federation is winning out, but generally as part of a larger architecture that still relies on some degree of warehousing. For example, Mark Jury, associate director of research information services at Amgen, said that as his company began relying more and more on external data, it became “impractical to host hundreds of terabytes of public data internally.” The company mapped out a system to continue warehousing its internal data by domain type, while turning to a third party to provide the middleware to federate external sources. After extensive evaluation of a number of middleware options, including IBM’s DiscoveryLink, Lion’s DiscoveryCenter/SRS, and WebLogic, Jury said the company opted for GeneticXchange’s DiscoveryHub. One of the key reasons for this choice, he said, was that his team could easily create and maintain its own wrappers for the product, unlike most of the other options, which required more vendor involvement to write new wrappers. Interestingly, Jury said, cost was not that big of a factor in Amgen’s final criteria. “It was such a small part of the overall equation that the cost of software was close to negligible,” he said.
Bristol-Myers Squibb also mulled the federation vs. assimilation question for a bit before deciding on a similar hybrid approach. Donald Jackson, senior research investigator in applied genomics at BMS, said the company opted to assimilate its data sources into three main data warehouses, and then federate them by data type. Jackson noted, however, that bringing data sources together isn’t the company’s primary integration challenge: Due to the “plethora” of names for the same sequence records in the multitude of public and proprietary sources, “figuring out corresponding records is very difficult,” Jackson said. The company has partially addressed this problem by standardizing all its sequence data on RefSeq, he said, because it’s a stable, curated, and open standard. “We can easily tell our alliance partners to use the RefSeq accession number when they provide us with data,” he said. Another overlooked issue in the data integration discourse, Jackson noted, is that “once the data is integrated, you need to get it in the users’ hands” — a step that isn’t as easy as it might sound. The BMS bioinformatics team recently began creating what it calls “genomic dossiers” to consolidate all available genomic information on program targets in individual reports that it delivers in person to program heads, in order to ensure the bioinformatics group is meeting the program’s needs. “Integrating data is a means to an end,” he said, “You have to get the information to the people who need it — by hand, if need be.”
David Silberberg, senior computer scientist at Johns Hopkins University’s Applied Physics Lab, presented a new approach to data federation that doesn’t rely on wrappers to access heterogeneous data sources. Instead, the system, called ADINA (Architecture for Distributed Information Access), uses a simplified version of SQL called RBQL (Role-Based Query Language) for what Silberberg called “automatic query formulation” — a heuristics-based approach that can automatically detect and navigate various database schemas and data structures. Essentially, wrappers and metadata aren’t necessary because the data sources can “describe themselves” to the system, Silberberg said. The technology was originally developed for the National Imagery and Mapping Agency and the Department of Defense, but Silberberg said the project recently received a grant from the state of Maryland to transition it to bioinformatics applications. In addition, an early-stage company called BioSequent has licensed the technology and plans to commercialize it early next year.
A Few New Twists on Not-So-Old Tools
When it comes to visualizing and analyzing data, a number of researchers are stretching the boundaries of currently available technologies into novel application areas. For example, Michael Liebman, director of computational biology and biomedical informatics at the Abramson Cancer Center of the University of Pennsylvania, is using LexiQuest Mine text-mining software from SPSS to mine the biomedical literature and build an ontology of breast development to support breast cancer research. “Disease is a process,” said Liebman, but medical ontologies like UMLS only have terms for physiological function or pathological function — not the process itself. In an effort to quantify information such as aging and environmental factors that also play a role in disease, Liebman and his team mined a 140-gigabyte database of full-length journal articles to generate a “concept map” of physiological development. The result, which Liebman dubbed a “tempology” because it has a temporal component, takes the traditional “is a” and “part of” terms of the Gene Ontology and turns them into “when is a” and “when part of,” he said. His team is using the new ontology in combination with GO and information on cancer progression to inform its search for diagnostic biomarkers that are “mechanistic, not correlative,” Liebman said.
In another twist on text-mining technology, Damien Chaussabel, a research investigator at the National Institute of Allergy and Infectious Disease, described how he is using WordStat, a text-mining program from Provalis Research, to extract co-occurring terms from Medline abstracts, filter them by relevance, and assign frequency values to them. Then, Chaussabel said, he feeds the Excel table of genes and term occurrences into the TreeView clustering software package to create a heat map for mining functional relationships. “You can find new groups of genes associated by keywords by visually browsing the heat map,” Chaussabel said. By placing the literature profiling map in the same view as the gene expression profiling map for the same set of genes, it’s possible to predict possible functions for genes that do not appear in the literature at all just by seeing where they land after clustering, he said.
Another novel use for clustering software came from Andre Nantel, research officer at the Biotechnology Research Institute at the National Research Council of Canada, who is using GeneSpring to visualize thousands of cross-species BlastP searches. By substituting the gene expression ratios with the E-values, or similarity scores for the BlastP searches, Nantel could easily compare the homology of the yeast proteome against eight other proteomes using self-organizing maps, heat maps, and other tools in the GeneSpring package, he said.
For those looking for the next level of visualization tools, Georges Grinstein from the University of Massachusetts, Lowell, said that his RadViz high-dimensional radial visualization software — the underlying technology that launched the now-defunct AnVil — would be publicly available in February. The package will contain over 70 visualization tools for multidimensional analysis, Grinstein said, but the novelty of the technology is not without a price. When asked if the package would be user-friendly, Grinstein answered bluntly, “No.”