A bioinformatics group at Japan’s National Institute of Advanced Industrial Science and Technology (AIST) has purchased a BlueGene/L supercomputer from IBM in a deal that marks the first sale of the system for the computational task it was designed for — protein folding.
A number of other research groups — at Lawrence Livermore National Laboratory, Argonne National Laboratory, and the Astron astronomy project in the Netherlands — have all placed orders for the system, but the Computational Biology Research Center at AIST will be the first team outside of IBM Research to run protein-folding simulations on the machine.
Bill Pulleyblank, director of exploratory server systems at IBM Research, called the agreement “enormously gratifying” for the research team that began designing BlueGene in late 1999. “We’re getting a second group showing that this platform really does provide the kind of capabilities that we hoped it would,” Pulleyblank said. “Getting that kind of external validation is wonderful.”
IBM did not provide financial terms of the agreement, and Pulleyblank stressed that the deal is essentially a research collaboration. BlueGene is still in a “pre-product state,” he said. “It’s not being offered as a standard IBM offering, and we’re not going out and trying to sell it.” However, he said, the company has been working with “selected partners” like the DOE labs, Astron, and AIST “who are willing to take on some of the challenges of dealing with early hardware and early software on it — not somebody who is trying to run a commercial business on it.”
Pulleyblank said that the company does not yet have any concrete commercialization plans for BlueGene/L, “but it’s something that is definitely being evaluated and pursued at this point.”
BlueGene/L is the first deliverable in the broader BlueGene research project that IBM Research launched nearly five years ago with the goal of building a system that could ultimately throw a petaflop of processing power at the computational challenges of protein-folding simulation. While future versions of the system — BlueGene/C and BlueGene/P — are slated to eventually reach that milestone, BlueGene/L’s performance levels are a bit more modest, but still impressive by today’s supercomputing standards.
Two BlueGene/L prototypes hit the No. 4 and No. 8 spots in the most recent Top500 list of the world’s most powerful supercomputers, weighing in at 11.7 Tflops and 8.7 Tflops, respectively.
The AIST system — which will contain four racks containing 2,000 processors each when it is installed in February 2005 — should reach 22.8 Tflops. IBM is also building a 64-rack system for Lawrence Livermore that should reach 300 Tflops, as well as a six-rack machine for Astron, a one-rack system for Argonne, and a 20-rack machine for its own Thomas J. Watson Research Center in Yorktown Heights, NY.
All these systems are scheduled for installation early next year.
BlueGene at Work
IBM said that the BlueGene/L system will be 24 times more powerful than the current computer systems at AIST. The BlueGene project lead at AIST could not be reached before BioInform’s deadline to discuss the research group’s current computational infrastructure or its plans for using BlueGene. But Pulleyblank said that AIST’s computational biologists will be using the system for protein-folding simulations as part of a collaboration with IBM researchers working on the same problem. “It gives them and us both this chance to watch the movie of the protein folding occurring,” he said.
IBM has relied on a three-pronged design strategy for the BlueGene project, in which the system’s hardware, software, and application software have all been developed in parallel to ensure that they work smoothly together as part of “an application-driven design,” Pulleyblank said. As a result, the team has a fair amount of protein-research software ready to run on the machine already.
Nevertheless, as part of the group’s broader mission to create a “framework that would allow the integration of various types of methods,” Pulleyblank said that IBM welcomes input — and software — from other protein-folding research groups such as the one at AIST. This framework, he said, was designed to be flexible, and was “intended to allow different groups to extend the software and contribute their own parts to it and take advantage of other parts.”
Pulleyblank said that the IBM research team has already witnessed a significant improvement in its simulation capability. Currently, most protein-folding simulations are on the order of a nanosecond, but Pulleyblank said that IBM has already approached the microsecond level on a “small” two-rack BlueGene/L system.
The BlueGene/L architecture is based on “extraordinarily efficient communication” between densely packed processors — a design that scales well and is well-suited to the demands of protein folding and other types of simulation, Pulleyblank said. He explained that most bioinformatics tasks can be classified into one of two categories: data analysis or simulation. Data analysis tasks — pattern matching, Blast searches, text mining, and the like — can run on any SMP architecture, he said. Simulation, however, “is different,” because it requires efficient communication between processors and therefore runs better on a supercomputer that was “designed as a totally integrated system” rather than hundreds or thousands of processors cobbled together.
This distinction can make a big difference in protein-folding simulation, Pulleyblank said, because of the extremely small time steps required for atomic-scale representation. A typical time scale for one of IBM’s simulations is a femtosecond — 10-15 seconds — which means it requires around a billion steps just to reproduce one microsecond of the protein-folding process. Pulleyblank acknowledged that other research groups are using Linux clusters and other smaller systems to simulate the protein-folding process, but noted that most of these groups are not able to model the water molecules that surround the proteins. These admittedly “uninteresting” molecules actually dictate the folding pattern of a protein because the final structure of the molecule ultimately depends on the hydrophobicity or hydrophilicity of each amino acid in the protein. Pulleyblank said that one goal of the BlueGene project is to account for these water molecules in the simulation.
Pulleyblank added that he sees a role for commodity-based systems alongside application-specific systems like BlueGene within life science research. The challenge for HPC designers, he said, is “how do you combine performance with keeping the cost level where it makes sense for people to do it?”
Any company, he said, “can design a computer that would be astonishingly powerful, but would be so expensive that nobody could ever afford it. That’s not hard.” On the other end of the scale, he said, commodity-based systems are lower in cost, but may not offer the best performance.
With BlueGene, Pulleyblank said, “We said we have to take something in between.” The system uses ready-made PowerPC processors, but combines them in a new way to include twin processors, the network controllers, and four megabytes of memory on a single piece of silicon. “Because of the way that we were able to integrate existing pieces, it became doable and we could also do it on the schedule that we wanted to do,” he said.