Predating IBM’s Life Sciences business unit by a good four years, the Computational Biology Center within IBM’s research division plays a vital role in the company’s life sciences strategy. Walking the line between pure exploratory research and early-stage product development, the group works quietly behind the scenes as a computational biology think tank with a watchful eye on market demands.
Organizationally, the 35 members of the globally distributed CBC make up an entirely separate entity from the life sciences business unit, but the ties between the two groups are strong. Carol Kovac, general manager of life science solutions, headed up the CBC before moving over to the business side, and Sharon Nunes, director of solution development for life sciences, started out at IBM Research as well.
Nunes now acts as the liaison between the business and research groups. Joe Jasinski, senior manager of the CBC, is her counterpart on the other side of the equation. The two meet regularly to make sure the goals of the research group are in line with those of the life sciences business and, conversely, to ensure the business side is up to speed on the newest challenges and technologies in computational biology research.
Unlike other corporate research labs, which have floundered financially in spite of their productivity, IBM Research is structured to ensure that a certain portion of the technology it is developing will be commercially relevant, Nunes said. While basic research is encouraged, other projects are carried out with the expectation that they will be on the market in one form or another within a three-to-five-year time frame. Nunes and Jasinski oversee a joint program between the CBC and the life sciences business group that splits the funding for several such projects.
Acting much like any funding program, researchers submit proposals to Nunes and Jasinski for consideration in the joint program. The proposals are reviewed for their technical merit and commercial relevance and those that are funded are re-evaluated on a quarterly basis. The center’s researchers are currently at work on seven jointly funded programs covering the areas of data management, privacy and security of data, knowledge management, and text mining. Two of these projects support automatic schema mapping and new wrapper development for IBM’s DiscoveryLink middleware product.
Jasinski said that most of the algorithmic work that comes out of the CBC is not targeted for commercial development, since the company has pledged not to compete with its bioinformatics software partners. The broader scope of the CBC’s algorithmic work encompasses five key areas, Jasinski said: bioinformatics and pattern discovery for large-scale genomics research, protein structure prediction, software development for the Blue Gene supercomputer, data management, and gene expression analysis. The latter category is branching out into an emerging project in computational systems biology, he said.
BioInform recently visited the headquarters of IBM Research in Yorktown Heights, NY, where the bulk of the CBC researchers are housed, to catch up on some of the lab’s latest work. Key project updates are outlined below.
Teiresias Suite Continues to Grow
Isadore Rigoutsos, manager of the bioinformatics and pattern discovery group, heads one of the longest-running computational biology projects at IBM Research. Teiresias, a two-phase combinatorial algorithm for general-purpose pattern discovery that Rigoutsos first developed in 1996, has grown into a complete package of twelve genomics-based pattern recognition tools that are available online (http:// cbcsrv.watson.ibm.com/Tspd.html).
In addition, the group has derived a set of amino acid patterns that it calls the Bio-Dictionary, and has used these patterns to annotate 76 complete genomes. The Bio-Dictionary and genome annotations are also available though the group’s website. Rigoutsos and his colleagues recently submitted a paper on their work using the Bio-Discovery collection of patterns as a gene discovery method. The team ran the genefinder program on 17 genomes and found that 5 percent to 10 percent of the genes it predicted were strongly supported by databases but hadn’t been previously detected by other automated methods. While the results are promising, Rigoutsos expressed caution. “We need to remember that this is an automated tool, so we have to verify our results experimentally,” he said.
Rigoutsos said he doesn’t keep careful tabs on usage patterns, but estimated that around three to four users download various tools from the website per day. The group recently initiated a mirror site program for university research groups. Indiana University became the first to install its own Teiresias engine just over a month ago, and Rigoutsos said several more universities are set to follow.
Rigoutsos’ primary goal for the remainder of the year is to migrate the group’s genome annotations, which are currently in flat files, to IBM’s DB2 relational database “to permit complicated searches within groups of genomes, a task that is very cumbersome now.” In the meantime, he said, the group intends to continue adding new tools and annotated genomes to its website.
The Spotlight-Shy CASP Guy
One of the CBC’s most high-profile projects this year may be Ajay Royyuru’s work in protein structure prediction. Royyuru, manager of the structural biology group at IBM Research, leads a team of four researchers preparing for the upcoming Fifth Community-Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP5).
But while Royyuru is keen on improving his team’s methods for CASP5, he’s a reluctant player in an event that has drawn increasing attention from the scientific community as well as the media. The “fiercely competitive” nature of the experiment tends to detract from what he sees as its intended purpose. “I see CASP as being just one of several checkpoints to establish progress in the field,” he said. “What I’m most focused on is improving over what I did last time.”
Since CASP4, held in December 2000, the structural biology team has made an important discovery that it is counting on to greatly improve this year’s predictions. In work led by David Silverman, the IBM researchers found a pattern in the transition between hydrophobic amino acids on the inside of a protein to hydrophilic amino acids on its outside. This pattern reduced to a ratio that was found to be a constant across all soluble, globular proteins in the Protein Data Bank, making the technique a viable scoring function for assessing the accuracy of predicted structures.
But while a useful complement to other methods, the technique does have its limitations. After testing the method on the Holm and Sander, Park and Levitt, and Baker protein decoy sets, Royyuru’s group determined that it is most effective on soluble, globular proteins of 70 amino acids or more. The smaller the protein, the worse it performs, Royyuru said, but since other current methods tend to work better on smaller proteins and degrade as protein size increases, he is confident that the technique will be useful when used in combination with other methods.
The team is already hard at work on the two target proteins recently released for the experiment. “Short-term, we are very heads-down,” he said. “We have plenty of work between now and the end of summer.”
Adding to the Microarray Analysis Toolkit
One of the CBC’s more recent activity areas is gene expression analysis. The group’s heavy background in pattern discovery made work in this field inevitable, according to Gustavo Stolovitzky, manager of the functional genomics group.
Stolovitzky’s team has developed a complete gene expression analysis package called [email protected] that is scheduled to be available online via a Java applet within the next few weeks (www. research.ibm.com/FunGen/index.html).
The core of the package is a supervised learning algorithm that Stolovizky said differs from standard clustering techniques. Comparing the field of gene expression analysis to the parable of the blind men and the elephant, each assuming that the part he is touching represents the characteristics of the whole, Stolivitsky said that the IBM tool is simply “touching where the other blind people are not.”
The IBM researchers are currently applying their tool in cancer research collaborations with groups from the Mayo Clinic and Columbia University.
Gustavo said the underlying technology for [email protected] would also be used for data mining behind the CBC’s newest projects in systems biology, although further details of that work remain under wraps.