Much to the chagrin of mothers everywhere, video games might actually be good for something after all. Well, at least the graphics processing units inside them. And it’s the video game players and their insatiable demand for eye-popping graphics that have video game board vendors competing to deliver more and more advanced GPUs.
But in addition to sheer processing power, commodity GPUs have increasing programmability for applications beyond rendering 3-D and 2-D images. For quite some time, researchers working in an area called GPGPU, or general purpose computing on graphics processing units, have aimed at ways to utilize the highly parallel structure of the GPU architecture for scientific computation. Newer generations of GPUs have demonstrated impressive speed-ups against CPUs when running popular database search algorithms; Smith-Waterman, for instance, clocked in with a 10-fold speed boost when run on commodity video game boards instead of on CPUs.
But in order to port algorithms to a GPU, developers must be expert at using languages like the OpenGL Shading Language, a high-level programming language, or other graphics-related libraries that provide access to the GPU. Aside from the required knowledge of specialized graphic programming languages and application programming interfaces unfamiliar to most developers working in bioinformatics, the other problem is compatibility with certain video drivers required to run the programs. This creates a challenge because GPU manufacturers often change the architecture of their boards to suit consumer demand, such as taking computational shortcuts to increase speed. This numerical incorrectness can cause serious trouble for applications requiring precise optimization on the GPU, such as HPC applications.
But this summer, the gap between GPGPU and GPU chip vendors shrank with Nvidia’s release of a new processor line built specifically for HPC. The company’s new Tesla GPU processing board contains 128 parallel processors and 1.5 GB of dedicated memory capable of delivering 518 gigaflops of parallel computation. In addition to providing customers with a powerful GPU board, the company has also rolled out a unique deskside GPU “supercomputing” unit that can connect to a PC workstation with a regular PCI-Express connection. And in an effort to reach the data center, the company has developed the Tesla compute server, a 1U server housing with eight GPUs that contains more than 1,000 parallel processors.
A Friendly Environment
To address the programmability issue, the GPU vendor unveiled a software development kit and application programming interface called CUDA, or Complete Unified Device Architecture, at the start of the year. CUDA is the first development kit to use the familiar and friendly C programming language to port algorithms to the GPU architecture, and it is free to download from Nvidia’s developer site. GPU programming experts such as Dinesh Manocha of the University of North Carolina at Chapel Hill recognize CUDA’s uniqueness in the GPGPU world. Manocha’s students have used CUDA for various projects including signal processing and linear algebra computations. “Overall, I am very impressed with CUDA and feel that the development platform is excellent for utilizing the computational power of GPUs for many non-graphics applications, including bioinformatics,” Manocha says. “I would say that it is one of the best platforms available today for users to port their applications to GPUs.”
But part of what makes Nvidia’s CUDA technology notable is not just its ease of use, but its compatibility with the company’s GeForce and Quadro GPU lines. So even without the new Tesla GPUs, CUDA technology is yielding great benefits for folks like John Stone, a senior research programmer with the Theoretical and Computational Biophysics Group at the University of Illinois at Urbana-Champaign. Stone and his colleagues are using CUDA to accelerate their tools on Nvidia’s older, consumer-oriented GeForce 8800 GPX boards used by gamers. Stone is the main developer of Visual Molecular Dynamics (VMD), a popular tool that lets researchers analyze and visualize the results of molecular simulations from programs like NAMD or CHARMM. “The type of user community I have for VMD is very different; they aren’t normally using hundreds of machines at a time and they may not be acquainted with the procedure for doing so,” Stone says. “The average user is looking for something that works with a typical desktop PC, so those type of people are more apt to use CUDA and GPU acceleration techniques if those will work with the stuff they have in the desktop.”
Some of the algorithms used by VMD are so computationally expensive that they were not possible to run until GPU accelerator technology came along, because it would have taken a hundred times longer on a CPU, Stone says. Before CUDA, if researchers wanted to run an electronic static potential simulation, the job would have to be run in a batch-mode setting that could take anywhere from several hours to weeks to complete for a very large system with millions of atoms. “We’ve got some test cases where the GPU is 110x faster — and with three of them, you’re at the equivalent of running a 300-processor cluster. That’s pretty amazing,” Stone says. “That was not possible before we had this CUDA environment and with the new GPUs that are coming out, it’s like having an additional computer inside your computer.”
The Cost Argument
Paul Rhodes, CEO and founder of Evolved Machines, says that Nvidia’s new GPUs have greatly accelerated his company’s neural circuit and neuron simulation research. Evolved Machines is focused on emulating the tree-like structures of neural circuitry in order to developing self-wiring arrays that simulate 3-D neural circuit growth, as well as visual object recognition and olfactory sensory processing systems. This type of simulation is highly parallel and very floating-point intensive, thus lending itself to the architecture of a GPU.
According to the company, they are the first group to use programmable GPUs to simulate neural computation. These chips have multiple layers of close-at-hand gigabytes of RAM, but unlike a CPU, cannot access this memory during every clock step. However, with this type of simulation, a GPU is a perfect fit because the data assigned to each neuron does not need to access RAM every clock step. Simulation of a single neuron requires 200,000,000 differential calculations per second, and real-time simulations of neural circuits can often require more than 10 teraflops of computing power.
Researchers at Evolved Machines initially looked at FPGA arrays and standard CPU clusters as a possible option for their simulation needs, but ultimately went with GPUs. “The CPU clusters are just prohibitively expensive compared to GPUs, both in terms of power and in money,” says Rhodes. “We’re getting the performance on one [Tesla] board, which is several thousand dollars of parts, that would have taken us $150,000 per cluster or a whole rack or more. It’s probably a 50:1 advantage in terms of footprint, wattage, and dollars.” Rhodes says that the company plans to build a larger system that will ultimately be running 24 GPUs in a 12 teraflop rack. Instead of spending roughly $10 million for a supercomputer, Rhodes says, he can get the same computational power for just $100,000 with Nvidia’s GPU solution.
So far, the Evolved Machines team has achieved 130-fold speedups on simulations that were previously conducted with current-generation, dual-core x86 CPUs. But what really closed the deal was that CUDA enabled them to seamlessly port their simulation software to these powerful chips. “It all comes down to the fact that they can compile C code and run the execution on these cards,” Rhodes says. “Of course, they’ve also designed a beautiful, distributed compute architecture and a distributed memory architecture that’s many-layered and very cool. But without the [CUDA] software drivers, compilers, [and] adaptors, we wouldn’t have been using it.”