If you think throwing more CPUs at a Blast job will reduce completion time, think again. Increasingly, both IT technicians and researchers are finding that their cluster's performance fails to scale despite the addition of nodes packed with warp-speed processors.
Last year at CeBiTec, a bioinformatics research center at Bielefeld University in Germany, IT experts used a bioinformatics cluster consisting of 128 Opteron nodes to compare the data transfer speeds of Sun's Network File System Version 3 and Replicator, the grid and cluster data provisioning software from Exludus Technologies. Coming in at 35 seconds, Replicator turned out to be 20 times faster at updating the database across the entire cluster than Sun's NFS V3, which took 14 minutes and 50 seconds.
For this evaluation, 935 MB of sequence data was simultaneously provisioned to all the nodes in the cluster. The research center also tested how Replicator fared at broadcasting data to different numbers of nodes. For 10 nodes, replication to every node was completed in 45 seconds; for 120 nodes, replication took about 35 seconds. Ralf Nolte, the IT systems administrator at CeBiTec's Bioinformatics Resource Facility who ran the benchmark testing, says he was truly impressed. "I don't know any other software better suited to our [cluster] that tackles the task of distributing over a large number of nodes," he says.
That's because Replicator addresses the major block to truly high-throughput cluster computing performance -- the I/O bottleneck. The tool is a software layer that sits between the file system and the workload manager and helps provide order to the slightly chaotic way most clusters deal with data. When a researcher attempts to run an array of Blast jobs against a certain number of nodes, the gaggle of CPUs will independently attempt to access the same data from a single file sever. That file server may only have 2 or 4 gigabit Ethernet connections, so what results is a virtual traffic jam as each CPU queues up to get its data.
Replicator aims to avoid this congestion by simultaneously broadcasting the same data to all the nodes in a cluster. "We eliminate that bottleneck by pre-staging data, actually even ahead of computation," says Stephen Perrenod, sales and marketing VP at Exludus. "While you're doing other compute work, we can load up data for subsequent runs, [resulting] in a more efficient processing pipeline." On clusters of roughly 128 nodes, Replicator typically achieves 20 times higher performance than NFS when provisioning a file the size of the human genome, he says.
And while Replicator's innovative design is clearly capable of impressive benchmarks, it's not a substitute for NFS. "I think that [Replicator] is a software that has a lot of potential in grid environments," says Ulrich Meier, who served as a technical advisor for the CeBiTec tests while he was marketing manager of life sciences at Sun. "It solves important problems for grids, but it doesn't replace a general-purpose product like NFS." Meier, who recently left Sun to join Global Life Sciences as a vice president and managing director, says that Replicator is truly unique. "It's really a new technology for grids and something that you can't classify because it's so different from any of the other [grid] products out there."
More Testing
Replicator demonstrated similar results when put to the test at the White Rose Grid in the UK. The White Rose Grid is a three-site, regional computing grid connecting Leeds, Sheffield, and York universities. The grid utilizes shared memory and clusters at each university, making use of roughly 1,000 processors. Aaron Turner, the grid's technical manager, ran a comparison test on the Beowulf cluster at York, with 10 MB per second networking and 1 GHz processors. The tests were conducted running Blast jobs on a subset of the HomoSap dataset on 10 nodes. Replicator was able to increase data flow to the nodes by roughly 4.4 times that of their NFS file system. For Turner, the results spoke for themselves. "Massive, massive improvement in throughput," he says. "We're talking about 400 percent -- and that was just 10 nodes of Beowulf, a fairly small system. To see that level of performance improvement is really pretty amazing stuff."
But nobody's arguing that Replicator will solve every cluster problem. CeBiTec's Nolte notes that "it's not really a general-purpose solution for every cluster in the world." It's not particularly suited for clusters handling frequently changing data, or for applications that require different slices of data. But for situations where nodes must be updated with new data on a regular basis, Nolte says, "I think this is the route which could solve lots of the network problems that are there now."
"It's not limited to just computational genomics," says Perrenod at Exludus. "It could be [used for] the molecular modeling [or] medical imaging side. … It could be potentially dealing with large database environments, clinical databases, or statistics -- a substantial amount of data that needs to be moved around."
The I/O Meltdown
"As the node counts in clusters grow larger, and as the chips go to dual-core, quad-core, and octo-core -- which can't be that far away -- the data pressure on the networks and the file servers just continues to grow and grow," Perrenod says.
One of the big reasons for disappointing scalability has to do with the design of traditional cluster software. Workload management systems, such as Sun's Grid Engine, came out of an era of batch scheduling on monolithic supercomputers, while I/O was treated as an afterthought, Perrenod says. Developers have attempted to address the cluster scalability issue with parallel file systems, but these are often difficult to implement and expensive to run. "We typically hear that it takes a full-time administrator in a lot of environments to manage a cluster file system," he says. "We even hear of cases where things don't synchronize and people lose files."
Part of Replicator's appeal is its simplicity. The software doesn't require any major changes to a cluster or grid's infrastructure, but instead works with the existing file system and workload manager. It's Linux- and Unix-friendly, and works with popular workload management systems such as Sun's N1 Grid Engine.
For the end user, Replicator's benchmark figures translate into shorter project turnaround time. "It means that the jobs you're running are far less I/O bound because the data's basically provided to the nodes in the background. Then, when your job is running, it's actually doing useful work rather than waiting for data," says Turner from the White Rose Grid. "That increases their research productivity so they're not there twiddling their thumbs."
Replicator's technology also allows researchers to check-point data while a job is computing and continue with a job even in the event of a power outage or disk failure. "You can actually pick up from the last set of results that was written out rather than losing weeks' worth of work," Turner says.
He adds that affordability is another appealing factor: At roughly $200 per node with two single-core CPUs, Replicator's price makes it an attractive option for the right scenario. "It's a case of doing your cost-benefit analysis," Turner says. "But I think that if you have a cluster that's any decent size, then you're unlikely to be able to get a cost-effective parallel file system to keep pace with that at the moment."