At the SC10 conference held in New Orleans in November, an international gathering of researchers, IT managers, and vendors met to discuss GPUs, cloud computing, and multicore processing. These technologies were loudly championed by their proponents, all with the same basic pitch: promising researchers who are dealing with mountains of data easier access to more compute power than their current HPC facilities can offer. But beneath the flashy hardware hubbub, a slew of freely available tools was presented, all aimed at helping scientists get more performance out of the hardware they already own. In the scientific computing community — which traditionally measures the power of a system's hardware in terms of how well a system should perform in theory if it was not tasked with processing and analyzing actual data — it is widely accepted that most clusters and data centers only perform to a fraction of their potential and use far more energy than necessary.
Hardware in the data center usually disappoints, not for lack of power, but because the code has not been properly tuned. "In this age of the many cores, the complexity of computing systems grows constantly, so to get [the] most out of your hardware takes more and more effort because you have to optimize your code for all of the additional new — not to mention existing — hardware in your system," says Martin Burtscher, associate professor of computer science at Texas State University, San Marcos. In the past, he says, bioinformaticians could just write an MPI code, but now it has to be a hybrid MPI or OpenMP code. "If you potentially want speedups it's worthwhile to try and really look at your code and go through it with a fine-toothed comb," Burtscher adds. "Even if you only make it one and half times as fast, if you think about having to buy more hardware to get those results, that would be really expensive. But spending a day or two on your code would be worth it to get those speedups at a much cheaper price."
To help the average bench biologist address these challenges, Burtscher and his colleagues developed PerfExpert, a tool with a simple user interface that provides scientists a clear picture of exactly where in the data center their code is slowing down and provides suggestions on how to fix the bottlenecks. "We looked at what kind of code typically runs on high-end computers and academic clusters and we found that, interestingly enough, the code is typically not very well optimized at all. And the reason is because most of the people who use the fastest machines are not computer scientists — they are chemists and biologists," Burtscher says. "They know what kind of algorithms they need to code up, but they don't know how to code it well for the machine. The people using the most colorful hardware that we have don't run very optimized code on them, which is kind of strange because you would think to get the most of it, you also want the code to be of really high quality."
Campus clusters are often pushed to capacity, not just from the explosion of data coming from the lab due to increasingly high-throughput DNA sequencers, but also from inefficient data analysis workflows designed by researchers who are not intimately acquainted with what works best for their hardware. In order to help researchers tune their experiments and analysis workflows to get the best mileage out of their clusters, a team from Ohio State University has developed pSciMapper, a tool that helps investigators do just that. "The idea behind this is that when you have many scientists using the same resources, you want to maintain high throughput and power efficiency. So what our software does is map different -resources in an intelligent way and allows folks to share resources in a way that prevents the entire system [from slowing] down in any noticeable way," says Gagan Agrawal, a professor at Ohio State and pSciMapper's lead developer. "We plan on having a version of this tool that can be used by the average Beowulf cluster users or academic researcher users for their local shared compute resource in about four months." On real-world synthetic scientific workflows, Agrawal and his colleagues have demonstrated pSciMapper's ability to reduce power consumption by up to 56 percent, with less than a 15 percent reduction in performance.
Perhaps the granddaddy of data center tools that should be in every IT manager's arsenal is the aptly named HPCToolkit, developed by John Mellor-Crummey at Rice University. HPCToolkit's users read like a who's who of top supercomputer sites — all of which regularly contend with large amounts of genomics data — including the Juelich Supercomputing Centre Institute for Advanced Simulation, the Swiss National Supercomputing Centre at the Swiss Federal Institute of Technology in Zurich, Argonne National Laboratory, Oak Ridge National Laboratory, and Lawrence Berkeley National Laboratory, to name a few. The tool provides insights into the performance of scientific code running on everything from large-scale supercomputers to the average wet lab workstation running Linux, but Mellor-Crummey says you don't need to be an HPC expert to use it.
"At the highest level, it can tell you what fraction of your time you spend at any point in your code. Then you know how to focus your efforts to try and improve the performance of your code — knowing just how much time you're going to spend in a particular place is not going to be enough in order to optimize your code effectively," he says. "This doesn't require a huge change to the way you work with your code, so you don't have to manually include a lot of instrumentation in your program or add a lot of compilers. We do everything with fully optimized binaries that you compile up the way you normally would, and then you just launch them with our tool ... so it's relatively easy to fit into a scientist's workflow."
While Mellor-Crummey says that his team does not really advertise the HPCToolkit all that much to the community for fear of being overrun with support requests, the tool is freely available and does come with a 100-page user guide that helps walk non-experts through the process of using the tool.
There's a Tool for That
Here's a list of applications worth considering for an HPC toolbox:
Open Speed Shop: This is an open-source, multi-platform Linux tool that supports performance analysis of applications running on both single-node and large-scale systems, including Cray XT and IBM Blue Gene platforms.
iHarmonizer and iOrchestrator: Both of these tools, developed by a team at Wayne State University, are geared toward improving I/O efficiency in a data center or cluster and the performance of parallel programs crunching large data sets.
Scalable Checkpoint/Restart library: Developed by researchers at the Lawrence Livermore National Laboratory, this tool is freely available through sourceforge.net. It enables checkpointing — a method of inserting fault tolerance into computing systems — for a parallel file system.
DC Pro Tool Suite: This is an application supported by the US Department of Energy that enables IT managers to evaluate energy efficiency with a profiling tool and a set of system assessment tools to locate certain areas in the data center ripe for improvement.
In addition, http://www.hpctools.org/ is a comprehensive source to download and learn about new applications and tools to improve data center performance. This community resource contains quality-control open source code, manuals, and documentation to help users get started.