If you belong to that shrewd set of technology-savvy folks who wisely balked at the idea of buying Apple’s iPhone upon its initial release, opting instead to hold off until all the kinks can be worked out, then you might have the same wait-and-see attitude toward the Cell Broadband Engine, a processor developed in a joint venture by Sony, Toshiba, and IBM and launched last year. If so, you could be waiting a while. In fact, you might want to roll up your sleeves and start helping because, even though all signs in the world of high-performance computing seem to point toward a future richly populated with multi-core heterogeneous processing technology, software developers will have to play a lot of catch-up to fully exploit this new architecture.
For the uninitiated, Cell is a heterogeneous multi-core processor that consists of one traditional 64-bit Power Processing Element as the main processor and eight co-processors called Synergistic Processing Elements, which act as vector processors. It began attracting the attention of many high-performance computing researchers with the release of the Sony PlayStation 3 in November 2006, and has since made the transition into use in blade units for clusters. (Check out GT’s original coverage in the Nov. ’06 issue.)
Cell first gave gaming a boost: the PS3 garners its graphics rendering power from the chip, although to save on manufacturing costs, one of the SPEs is disabled in production. Some people have attempted to use the gaming console in a cluster-like setting, but results have been mixed. In early March, Frank Mueller, associate professor of computer science at North Carolina State University, announced that he had built the first fully operational academic PS3 cluster for less than $5,000. But Mueller cited the 256 MB of RAM as a constraint in a cluster setting. Short of modifying the motherboard, Mueller says, it is possible to add RAM with USB memory sticks or FlashCards, but it’s hardly worth the Band-Aid solutions due to memory latencies.
“To me it looks like the PS3 center is more of a hobbyist cluster, and let’s face it, students would sooner have a PS3 on their desk than a machine room down the hallway,” says David Bader, executive director of high-performance computing at Georgia Tech. “It’s attractive and it’s certainly low-price, but if you really want to do high-performance computation, and need a very robust high-performance system, that’s what the Cell blades will provide for you.”
This past July, Georgia Tech became one of the first universities to get a Cell-equipped IBM BladeCenter QS20 Server cluster up and running. The Georgia Tech “CellBuzz” Cluster includes 28 processors and runs on Linux for Cell. Bader says that it was surprisingly easy to get CellBuzz into production use. “We expected some difficulties being a new technology and a new platform, but we had it up and running in about a day,” says Bader. “There were really no complications other than figuring out how to create a good node and duplicate it.” For all intents and purposes, CellBuzz looks and operates just like any other Linux cluster, he says. And although Buzz has yet to launch a Web portal, the cluster is open for students and for independent software vendors and developers to test their code on Cell.
Because of its unique design, developing parallel code that takes full advantage of the Cell is hardly a walk in park. The processor has been the subject of numerous programming workshops over the last year, as well as countless research papers presenting methods to grapple with implementing parallel code for the multi-core design. Many top computer science departments at research universities, from Massachusetts Institute of Technology to Budapest Polytechnic Institute, have incorporated Cell programming into their courses. IBM has of course encouraged this through initiatives such as the Cell University Challenge, which promises cash prizes for the biggest breakthroughs in Cell programming.
Building a Tool Kit
But when it comes to actual tools that the average bioinformaticist, burdened with typical large datasets, can use with the hardware — well, they’re tough to find. Some tools are on the horizon, but they’re not ready for prime time just yet.
IBM researcher Vipin Sachdeva has been hard at work fine-tuning performance enhancements of the first-generation Cell blade on meat-and-potatoes bioinformatics applications, such as ClustalW and Smith-Waterman. “One of the things about the life sciences applications that’s really useful is that most of the applications have this small kernel, which just takes 99 percent of the running time,” Sachdeva says. “All you have to do is take that kernel, put it on the Synergistic Processing Unit, and just use data back and forth from the SPUs to run the application.”
Sachdeva was able to port just a portion of the ClustalW code to the SPUs in the Cell, but those bits of code did achieve speedups of 10 to 20 times what you’d get with high-end processors using standard architectures. In the case of Smith-Waterman, Sachdeva was only able to process limited sequence sizes due to the chip’s local store memory size of 256 KB. Still, in this scenario, speedups of five to 10 times were achieved. Sachdeva points out that this is not an architecture issue, but rather a coding challenge that will require more development to come up with a work-around. At this point, none of these codes is user-ready, but he hopes that will soon change.
Regardless of the application, Cell-specific parallel programming requires a paradigm shift for developers. “I think multi-core is a difficult problem. You’re trying to think of things in a parallel fashion, so it becomes more difficult,” Sachdeva says. “If you do any kind of [Web] search, the hardware is going multi-core, but the software is still not catching up.”
Volodymyr Kindratenko, a senior research scientist at the National Center for Supercomputing Applications, and his colleague Guochun Shi will be presenting a poster at this year’s ACM/IEEE Supercomputing Conference to detail their efforts porting the molecular dynamics simulation program NAMD to the processor. They expect that the part of their presentation that will be most interesting to attendees is the implementation steps involved in taking serial code and converting it into the form required for the SPU processors. “The Cell processor has good potential for bioinformatics. There’s some work involved to take advantage of it, and if you want to get performance out of the Cell processor, you have to hand write [your code],” Kindratenko says. “But in general, it provides performance improvements that are comparable to larger, multiple-CPU systems.”
In order to port NAMD to the Cell, Kindratenko and Guochun had to make some simplifications to the code, but it still produces the correct results, Kindratenko says. The researchers found that each patch of simulated atoms in NAMD is small enough to fit onto one SPE for each task. The researchers used both IBM’s Cell software development kit, which uses the C programming language, combined with their own in-house libraries to accomplish this task. While the effort was hardly non-trivial, they say it was worth it. Their version of NAMD on Cell generally demonstrated performance enhancements of up to five times what you’d get with current top-of-the line processors. The NAMD Cell code is still in the development phase and is not yet available to the general public. The big challenge, Kindratenko says, will be to take the full-blown production version of NAMD and map its code for the multi-core chip.
Most of these researchers agree that if you want to see real performance results in the new world of heterogeneous multi-core processing, there are no magic shortcuts. “Looking at the Cell, it’s actually very early in a processor life cycle compared to competing multi-processors,” says Bader. “The future is heterogeneous multi-core such as Cell, but you have to understand that Cell has really been put out there in such an innovative manner — when the hardware’s been ready long before when the software’s matured.”