Much ado was made at the Supercomputing '08 conference held late last year in Austin, Texas, about a new computing solution unveiled by a company called Convey Computer. The nascent firm touts a server unit known as the HC-1 that combines an Intel Xeon processor, commodity FPGAs, and a patent-pending architecture as a breakthrough in "hybrid-core computing." But is this unit just like any other acceleration hardware, or is there something to the hype? In this case, it seems that the proof is in the proteomics pudding.
Researchers at the University of California, San Diego's Center for Computational Mass Spectrometry have discovered that a single HC-1 has enough processing power to provide a good enough argument for doing away with their Beowulf cluster and bevy of servers. At the center, where one of the goals is to mine proteomic samples for unexpected data, the scientists have developed software to identify significant hits in their data — but it happens to be extremely taxing to run.
"The software is convenient for the biologists, but computationally, of course, it is a much heavier task and it requires a lot more computational time and muscle to be able to ask these sorts of questions," says Nuno Bandeira, the center's executive director. "In the past we've shown that these tasks can be parallelized over multiple machines, but it's also come to our attention that it can be done in a much more compact way, and even more cost-effective way, if one uses approaches like FPGAs, which is exactly what Convey is proposing."
Initially, the center's software was developed for use on a single computer, but as everything became increasingly slower, the team decided to move it to compute clusters and multiple server units. "But as you'd expect, as soon as we started having the ability to run more, we also throw a lot more data at it. So in some of these projects, even though we were running analysis on 200 computers, it was still taking about two months to complete the data analysis," Bandeira says. "So this new development from Convey speeds it up to about 100 times what we can do on a single machine. In principle, if one would have just a few of these HC-1 units, we could finish up the whole process in much, much less time." More specifically, Bandeira and his colleagues have found that a single HC-1 can do the job of eight of their current servers, using a comparable amount of energy.
Coherence and custom
As with any shiny new HPC toy, it's important to get past any possible confusion from all the hype in order to understand the technology's value. After all, explains Rajesh Gupta, a professor of computer science at the University of California, Irvine, any vendor using a coprocessor accelerator can claim to be "hybrid-core." Considering Convey's HC-1, Gupta notes, "There's no hybrid device. They are not a chip maker, they are using commodity parts, so the innovation is in the system, and that's the architecture and the software that goes on it." He adds, "Pretty much any machine builder that uses a coprocessor accelerator can claim to be 'hybrid,' so the term does not have much meaning."
For quite some time, people have used FPGAs as accelerators or coprocessors in life sciences with varying degrees of success, including speeding up things like Blast and proteomic software. But according to Gupta, the difference that Convey brings with its HC-1 unit is from a conceptual and architectural point of view. "What they've been able to do is take the coprocessor and put it very close to the CPU, and the technical term I use for that is called 'coherent coprocessing,'" he says. "The idea here is that the coprocessor, which is built into the FPGA, is in lock step with the memory system model that the CPU is using."
Without getting too lost in the technical nitty-gritty, what makes the HC-1 standout is that, unlike other hardware accelerator devices out there, this solution has a code compiler that generates optimized code for both the Intel processor and the FPGA coprocessor in a way that takes some of the headache out of determining which portions of the code to send to which parts of the hardware. In other words, when one compiles code in preparation for acceleration, there is no conceptual or programming divide between what portions of the code are siphoned off to the FPGA and which are funneled through the standard CPU. The code optimizer and generator that ships with the HC-1 automatically identifies code that can be dispatched to the coprocessor in order to be run in parallel to achieve speedups. It also helps that the native operating system for the HC-1 is run-of-the-mill Linux, which allows application codes compatible with Linux running on an Intel 64 to run unchanged on Convey's system.
While it is too early to tell what kind of impact the HC-1 will have on bioinformatics, Gupta says that one clear advantage is the fact that this new system may actually be able to deliver where other hardware accelerator makers have not. "There has been a big gap between claimed performance of a machine and delivered performance of a machine, and the reason for that is that all the machines that have been built have been made for some broad class of applications," Gupta says. "This so-called customization has been on the minds of the hardware community for a very, very long time, but hand-crafting a machine is tough. However, Convey makes customization systematic, and this is a plus."
This ability to easily customize code is being offered in the form of a software development kit that contains what the company calls multiple "personalities," which is a much more interesting way of describing highly adaptable programming instruction sets that can be tailored to support various algorithms by allowing customers to use standard languages such as C++ and Fortran. This software development kit enabled Bandeira's team to accelerate some software tools, such as the InsPecT/MS-Alignment program, with results of up to 100-fold speedups. "For InsPecT, UCSD ... implemented a specific 'personality' that represents the core of the application," says Kirby Collins, product manager at Convey Computer. "This required specific FPGA programming enabled by our personality development kit, which handles the interfaces to the host and the high-performance memory system, [allowing] the logic design to focus on the search algorithm, with benefits to both productivity and performance."
With computational needs of the proteomics community becoming more and more demanding, Bandeira believes that researchers really should not have to worry about justifying their budgetary requirements for a massive room of commodity CPUs. "When people think of buying an instrument, and all the complexity associated with running that instrument, they often forget about the downstream analysis. And the reality of it is that no one has time to go through millions of spectra manually, so if the last step in the chain fails, you're really not capitalizing on your investment of instruments," he says. "The set of skills that is required to maintain all of these things is so daunting, so if one could simplify the computational analysis step and not require labs to maintain compute clusters — which [they] are often not equipped to do — that would be a huge advantage."
Right now, Convey is still in the pre-beta stage of deployment with initial shipment planned for April. Collins says that life sciences, with its ever-growing datasets and tightening budgets, will be an important application segment for the company going forward.