Since IBM’s announcement last year that it would spend $100 million to build a supercomputer called Blue Gene for protein folding research, it has begun collaborating with scientists at Indiana University, Columbia University, and the University of Pennsylvania on some of the mathematical techniques and software needed for the system.
The company has also decided to use a cellular architecture for the machine, where it will use simple pieces and replicate them on a large scale. Protein folding research requires advances in computational power and molecular dynamics techniques – the mathematical method for calculating the movement of atoms in the formation of proteins, said Joe Jasinski, IBM’s newly appointed senior manager of Blue Gene and the computational biology center.
“The first problem that we are attacking with Blue Gene is to understand at the atomic level the detailed dynamics, the motions involved in protein folding,” Jasinski said. “That’s a very computationally intensive problem which requires at least a petaflop computer and probably something bigger.”
Most of the system software as well as the routines that will drive the applications are being developed by IBM’s computational biology group, which was formed in 1992 and now numbers about 40 scientists and engineers.
Jasinski said that Blue Gene’s link to bioinformatics is that to make full use of genomic data, researchers need to understand the structure and function of the proteins that genes code for. More knowledge about protein folding should be beneficial in rational drug design and for understanding protein-protein interactions and protein-small molecule interactions because investigations in those areas typically involve molecular dynamics calculations.
“Blue Gene will enable those calculations to be done on much larger systems with many more atoms or for much longer periods of time,” said Jasinski.
IBM plans to start testing the processor for Blue Gene in the first quarter of 2001. “The finishing touches are being put on the designs for the chip right now,” said Jasinski.
Blue Gene’s architecture is based on specially designed gigaflop processors that contain minimal instruction sets. Thirty-two processors along with 16 megabytes of memory are mounted on a chip, and 64 of the chips are placed on a board about 20 inches square, with each board containing 2 teraflops of computing power.
Eight of the boards will be placed in a tower, and 64 such towers will be connected to create Blue Gene. With eight calculation threads distributed among more than a million processors, switching circuitry is obviously critical.
Also, because of the inevitability of processor failure, the machine has been designed to be “self-healing.” If an error is detected in a processor, the last phase of computation is retried; if it persists, the failed component is isolated and the system software routes future calculations around the failed component.
The task of simulating protein folding is formidable because only subtle differences of energy and entropy separate the folded from the unfolded state, and a huge number of potential iterations, both within the protein and between the protein and the solvent, are involved in pushing and pulling the chain into its final shape.
With thousands of atoms in the protein and the surrounding solvent, there are tens of millions of forces to calculate and add up at each time step, with perhaps 200 billion time steps necessary to follow the chain from its denatured to final folded state.
Given the twin challenges of architecture and computation, will Blue Gene work?
Ambitious projects are inevitably risky, but Andrej Sali, who does computational biology at Rockefeller University, said that even though IBM was initially perceived in the folding community as being somewhat cavalier in its treatment of the energy function – the core of the folding problem – he expects the machine will open new possibilities for structural simulations.
“With this speed and power you don’t have to just pick one routine that you happen to like and run it for a year, you can instead try a number of approaches and find out what works best and spend your time on that. People haven’t given much thought to what they would do with so much CPU time because it has been totally unrealistic. But when you get a computer so much faster than anything else, you open a new universe. I’m sure that surprising and appealing new proposals will emerge from it.”
In addition to providing a challenge to test the next generation of supercomputers as they are being developed, IBM hopes the five-year Blue Gene project will give it credibility as a major player in life sciences, which it views as a profitable growth area. Six weeks ago IBM announced a separate allocation of $100 million to establish a life sciences group to serve IT needs in the pharma, biotech, and bioinformatics communities.
Anne-Marie Derouault, IBM’s director of life sciences business development and marketing, said that because data loads were comparatively low during the sequencing phase in comparison with what’s coming, the incentive wasn’t there from a business perspective to compete in what until quite recently has been an ad hoc collection of cottage industries.
“But going from genomics to proteomics you gain one or two orders of magnitude in terms of processing and storage needs. We can do teraflop and petaflop computing, and we want to associate IBM with that kind of power in the life sciences community,” she said.
—Potter Wickware and Matthew Dougherty