Every month at universities and research institutes, hundreds of hours go by while CPUs at workstations or in clusters sit idle. The lull in activity can occur during the wee hours of the morning, or when users have momentarily stepped away from their desktops for a lunch break. It’s the combination of all this wasted computing time, plus the cost of routinely purchasing a slew of CPUs every two or three years, that has Gerry McCartney, Purdue University’s chief information officer, convinced that it’s high time desktops started earning their keep. “Rather than going to buy another big machine, why don’t you just harness the cycles you already own?” McCartney says. “We already have thousands of these little machines, but they’re not optimized for high-performance computing.”
The idea of cycle harvesting or cycle stealing is nothing new. It first came to prominence with the [email protected] project, which launched in 1999. With [email protected], anyone with an Internet connection can download software and make their desktop available to help search for evidence of radio transmissions from space. But cycle stealing is also becoming an attractive option for researchers and corporations simply looking to get the most out of their CPUs.
Last year, Purdue implemented a workload management system called Condor to ensure that all of their desktops are kept working almost 100 percent of the time. Condor is an open source application developed by researchers at the University of Wisconsin, Madison, that scavenges for unused CPUs in a cluster, grid, or network, and quickly puts them to work. Throughout the day or night, if any of some 4,400 CPUs in Purdue’s Condor pool are found jobless, they will immediately start crunching data on whatever jobs users have submitted to the Condor job queue from their individual workstations. Condor runs on Linux or Windows, and is capable of reaching out to other computers beyond the immediate geography of its users, regardless of the local network system. McCartney claims that, at present, Purdue is now one of the largest Condor installations in the world.
But it’s not just a change in desktop work ethics; this workload management system gives researchers a way to get high-performance computing without having to spend high-performance dollars. At the Purdue Genomics Facility, investigators used Condor to run Blast jobs and experienced significant speedups. The facility took more than 16 days to run 10,000 jobs comparing 469,115 short sequences against each other using their own small cluster, which consisted of 12 CPUs. When they switched to using the Purdue Condor computer pool, the same 10,000 jobs were completed in 16 hours, using only a subset of available CPUs in the pool.
“If you’re running your own cluster, and you’ve got 200 machines, you’ve got 200 machines, that’s it,” says Condor user David Schwartz, director of the Laboratory for Molecular and Computational Genomics at the University of Wisconsin. “Today, if I needed 2,000 machines, I could get 2,000 machines, but if I had to write a grant [to purchase that many computers], I’d have to wait a year, if I’m lucky.”
Schwartz, whose lab is working on developing an optical mapping system that barcodes hundreds of thousands of individual DNA molecules at a time, says Condor has catalyzed his research. “We solve problems now that we normally couldn’t even consider doing previously,” he says. “The beauty of Condor is that you get as many machines as you like, and it works with a community of researchers.”
Condor’s lead developer, Miron Livny of UW Madison, says the trick to successful cycle harvesting is giving the CPU owners priority over their machines while still allowing those machines to be available in the pool. “Every [computer] in the system is capable of defining its own policies on when and how it’s used,” says Livny. “The whole system from the bottom up is based on the assumption that every resource can disengage and change its behavior.” Desktops in a Condor pool can be configured for availability only when the mouse or keyboard is inactive for a specific period of time. This control gives Condor a slight advantage over commercial equivalents without as much functionality, says Livny, such as Platform Computing’s LSF job scheduling software, or IBM’s Loadlever, which is actually based on Condor's architecture. (A Platform spokesperson says LSF is an option for researchers seeking scalability and reliability when working on large-scale projects such as genome mapping. IBM did not return a request for comment.)
Condor can work on a shared file network, or simply transfer files to an open CPU for processing and return the data to the user once the job is complete. The system is also designed with the goal of saving researchers time by being a launch-and-forget solution. After job submission, users can walk away while Condor manages the job for them; later, it notifies them when the job is complete. For Livny, the goal is to enable a single investigator to do 10,000 or 20,000 hours of computing a day without having an army of postdocs and students pouring in data.
A Condor pool is, however, slower than a single high-performance machine on a job-by-job basis. And parallelizing code, while possible, is hardly trivial, says Purdue’s McCartney. Condor’s checkpointing feature is also not perfect, but this is all a result of not having a large technical community of developers behind it, he says.
But ultimately, says Schwartz, the cost of constant upgrades to increase computing power is simply not always justified. And what you get in the end, not what you use to get it, is what matters. “As a genomicist, what you want is the answer,” says Schwartz. “You don’t care so much about the hardware, you don’t care about the efficiency of the software — except if it gets you the answer.”