A recently funded collaborative project in biomolecular imaging information systems could usher in a powerful new application area for bioinformatics — as long as the fledgling initiative overcomes some formidable technical challenges.
The project, entitled “Next-Generation Biomolecular Imaging and Information Discovery,” and headed up by the University of California Santa Barbara and Carnegie Mellon University, received a $9.4 million award from the National Science Foundation’s Information Technology Research program in late September. The goal of the project, which also involves research teams from Massachusetts Institute of Technology and University of California Berkeley, is to develop new informatics technologies to extract, store, and retrieve information from biomolecular images captured with high-resolution microscopic techniques. The effort promises to automate the analysis of subcellular image data, which is currently done solely by eye — a painstaking process that the initiative’s leaders hope to render obsolete by driving biomolecular imaging into the age of high-throughput biology.
“It’s a little bit like Genbank before some of the sequence analysis tools existed,” said Robert Murphy, a professor of biological sciences and biomedical engineering who is leading Carnegie Mellon’s work on the project, for which the university was awarded $2.5 million. In the early days of sequence analysis, Murphy said, “people would have to try and classify genes by comparing their sequences by eye.” Until informatics tools like Fasta and Blast were developed to allow researchers to effectively mine the data in Genbank, the project didn’t take off, Murphy explained, adding that biomolecular imaging is currently in a similar catch-22 situation: “We’re developing the tools that can be used, but we still need to have the significant data collection efforts be completed,” he said.
Image analysis isn’t new to bioinformatics — it’s a crucial step in transferring the fluorescence data from microarray experiments into a computable form, and is also a key element in 2D gel analysis. Even cellular screening systems that pharmaceutical companies use in lead validation rely on image analysis techniques. What sets the new generation of biomolecular imaging informatics apart is the resolution of the images involved: Using sophisticated fluorescence microscopy, very subtle, molecular-scale changes can be detected at the subcellular level directly from the cells themselves.
Location, Location, Location
According to Murphy, biomolecular imaging informatics tools will play an important role in advancing location proteomics, an often-overlooked branch of proteomics that is attempting to catalogue and characterize the location of every protein within each cell. While databases like Swiss-Prot contain “the names of organelles and maybe some descriptive text associated with each protein … the basic problem is that terms don’t exist to describe the complexity of the patterns that proteins show within the cells,” Murphy said. His team turned to fluorescence microscopy several years ago in an attempt to systematically map the location patterns. In this process, proteins are labeled with either antibodies or green fluorescent protein fusions, and then the cells are run through the microscope to determine where the proteins are located.
Murphy’s team has already made progress toward automating the analysis of these images. They have trained a classifier so that it can assign a new protein to one of the major organelle classes. Murphy said the classifier system is accurate more than 90 percent of the time — even for proteins that people are unable to distinguish visually. By collecting the fluorescence patterns from many proteins, Murphy has created a set of what he calls subcellular location features (SLFs), which are numerical descriptors that can be used to index databases in order to search for images that are similar to a query image — “the same way that we do a Blast search,” Murphy said.
Also in the pipeline is a system analogous to the clustering and classification algorithms used in microarray analysis to identify sets of co-regulated genes. “We’re working to develop methods for grouping all the proteins so that groups of proteins that share a subcellular pattern would be distinguishable,” Murphy said.
Before there can be the equivalent of Blast for bimolecular image data, however, there has to be an equivalent to Genbank, and that type of resource might not be available any time soon. Murphy said that there are a “handful” of scattered projects to store collections of images, but there is not yet any coordinated effort to create a centralized resource. In addition, those groups that are storing their image data generally rely on file names, captions, or other text-based systems to retrieve the information — not the images themselves.
According to Bangalore Manjunath, an electrical and computer engineering professor who is leading UCSB’s portion of the project with $6.9 million in NSF funding, the technical considerations of creating a data repository for image data are considerable. While sequence data is a simple, linear string of nucleotides or amino acids, and gene expression can be stored as tables, a typical biomolecular image has around 256 quantitative descriptors associated with it, he said, and “existing databases don’t support efficient searching in multidimensional space.” Creating an indexing system for such high-dimensional data is a challenging process, he said, so one of the short-term goals of the project is to create a prototype “digital library” to store biomolecular imaging data in a way that will permit ready access. His team is also looking into adopting existing image data standards, such as MPEG7, for biomolecular image data.
The UCSB researchers are focused on three main areas: the development of databases to store image data; pattern recognition and data mining techniques for analyzing image data; and designing new imaging instrumentation to speed the scanning process and generate higher-quality images. Manjunath said that one project goal is to apply atomic force microscopy to sub-angstrom resolution biomolecular image analysis.
Murphy, meanwhile, is focusing on improving the search and retrieval tools he has already developed, with the hope of accelerating large-scale collections of biomolecular image data. “These tools can be used just like Fasta and Blast were used before the genome sequences were finished,” he said. “They can be used to do comparisons, to find similar proteins, to test whether or not a particular drug has affected something, and they can be used to get answers to specific questions while we’re in the process of collecting data that can characterize all the proteins.”
Murphy’s immediate aims also include a dash of technological evangelism about the promise of biomolecular image informatics. “One of the goals is to expand awareness of these tools among the broader biological research community,” he said. He may not be a voice in the wilderness, however, despite the relatively early stages of the project. GE’s recent acquisition of Amersham Biosciences was founded on two common technology areas: informatics and imaging [BioInform 10-20-03]. If that deal is a sign of things to come in the field of commercial biomolecular imaging informatics, the NSF-funded project may be the first of many more such efforts in the public sector.