For big bioinformatics tasks, Nat Goodman says bridled teams of PCs could plow over supercomputers.
Compute farms are flourishing in the fertile fields of high-throughput bioinformatics. Nourished by a steady rain of cheap hardware, clusters of computers lashed together to behave like single, large computers are feeding the growing computational appetite of modern biology.
A new generation of management software plays the role of tractor, easing the burden of software developers and system administrators by making compute farms behave more like supercomputers. Legions of consultants serve as high-paid migrant workers, helping to plow, plant, and configure these systems. Compute farms are not quite off-the-shelf, but they’re getting close. They are becoming a major source of compute power across the industry.
A cornucopia of success stories testifies to the value of the approach. Incyte operates a farm containing 3,000 Intel (PC) processors. Celera has a farm of some 800 Compaq Alpha processors. Inpharmatica, a new bioinformatics company, uses a farm of 900 PC processors for its Biopendium product. The big sequencing centers at Sanger and Whitehead have large Alpha farms, and Compaq deserves kudos for providing gratis a farm with 100 Alpha processors to scientists who are annotating the public human genome sequence. The University of California at Santa Cruz is using a 100-PC processor farm to assemble the public human genome sequence. Biogen is working with Blackstone Technology Group, a consulting firm that specializes in compute farms, to acquire a large PC-based farm. Many more companies hoeing this row have kept their plans private.
Ripe for a Farm?
A compute farm won’t do you much good, of course, unless your workload can be divided into tasks that can run in parallel. But farms work well in high-throughput bioinformatics because the workloads are generally laden with easily recognized parallel tasks.
Here’s a simple example. Consider a facility that generates 10,000 sequences a day and wants to BLAST each new sequence against a database of known sequences. The ensuing 10,000 BLAST searches can be run in parallel, since the results of each search are independent of the others. This type of parallelism, job parallelism, is easy to exploit, because all you have to do is run each input job as a parallel task. Several companies, including Blackstone and TurboGenomics, offer fully packaged parallel versions of BLAST.
Reality is more complicated. Usually, before you run BLAST, you have to run each sequence though a multistep analysis pipeline, for instance, to strip vector sequences, check sequence quality, and mask repeats. For a given sequence, these steps must run serially in the given order.
For multiple sequences, you have several choices: you can run the entire pipeline in parallel; or you can run all sequences through one step in parallel, then move on to the next step; or you can feed all the work into the system at once, and let the system choose the order of tasks (provided that for each sequence, the steps proceed in the correct order).
This complexity is common. Either you or the farm management software has to select a strategy to run your application efficiently. It’s also worth noting that for the repeat masking and BLAST steps to run efficiently, the databases searched by these steps must be stored locally on the computers doing the work; otherwise the farm wastes too much time copying data across the network.
Rows to Hoe
We also see a lot of data parallelism, in which you divide a database into pieces and run a task in parallel on the sub-databases. Continuing with our example, suppose the sequencing facility has been given the job of sequencing bacterial genomes at the rate of one per day and that you want to compare each new genome with hundreds of existing genomes. As it takes hours to run a whole genome comparison, a good method is to divide the database into pieces containing one or more whole genomes, store those pieces on separate computers, and compare in parallel the new genome against the pieces.
Note that with data parallelism, each computer runs the same input against different data, while with job parallelism, each computer typically runs different input against the same data.
A final, and much trickier, type of parallelism involves modifying the internals of an existing algorithm. In our example, before you can consider doing whole genome comparisons, you have to assemble your daily ration of 10,000 sequences into a whole genome. One way to speed up the assembly process is to jump into the guts of the assembly program and modify it to exploit parallelism. Most bioinformaticists are not trained for such highly specialized work. Fortunately, parallel versions of many popular algorithms are available from the original developers or other sources. There are also several consulting firms who specialize in this kind of work, including Southwest Parallel and Kuck & Associates.
If your application has enough parallelism, you expect it to run faster as you add more processors. The term speed-up refers to the amount of improvement you get as you add processors; the best you can get (except in certain contrived cases) is linear speed-up, meaning that your system goes n times faster when you have n processors. Rarely do you achieve this ideal, because of overhead consumed in coordinating the tasks, such as moving data from task to task.
To achieve good speed-up, the parallel tasks in your application must be large enough that you don’t waste too much time coordinating their execution. The term grain size refers to the ratio of compute time to overhead; coarse grained means the ratio is big, and fine grained is the opposite. You’ll also hear the technical term embarrassingly parallel used as a synonym for “very coarse grained.” Compute farms work best on coarse-grained computations, and that’s what we see in most high throughput bioinformatics applications.
Data movement can cause substantial overhead unless you or the farm management software are clever enough to store the data needed by a task on the computer that will run that particular task. Excessive data movement is a double whammy: not only does it add overhead to the task at hand, but it also tends to slow down other tasks by adding load to the file server and network. For simple farms that run a single application ¯ for example, a farm that just runs BLAST against a single database ¯ it’s easy enough to copy the database to each computer at the beginning of the day. For general-purpose farms, this is a bigger challenge and a key function of modern farm software.
Most of the computers in a farm are the compute nodes that do the bulk of workload execution. A farm also needs a master controller to coordinate the compute nodes, and a file server to store data that is shared by the compute nodes. In a small farm, some computers might serve multiple functions, even to the point where a compute node might also run the controller and file server functions.
The computers are often multiprocessor machines themselves, and people typically characterize a farm by the number of processors rather than the number of computers. You can build farms out of inexpensive PCs (at, say, $1,000 per processor), pricey proprietary machines like Alphas and Suns, or a mixture.
The hardware that connects the computers — the interconnect — is critical; if it’s too slow, the compute nodes will waste a lot of time waiting for work or data. In a small farm, you might get by with cheap 100BaseT Ethernet (the stuff used in small office networks), but for larger farms, people typically use gigabit Ethernet or switch-based products like Giganet or Myrinet.
A key advantage of Giganet and Myrinet over Ethernet is that they scale better as the load increases. In addition, they support a new communication standard, the Virtual Interface Architecture, that incurs much less operating system overhead than Ethernet. Of course, these benefits come at a cost: cheap Ethernet costs about $50 per machine, gigabit Ethernet is up to $500 or so, and Giganet or Myrinet are in the $1,000-$2,000 range.
The software that binds the farm together — the farm management system — has three main functions. Load balancing automatically assigns tasks to compute nodes, data balancing automatically moves data to the node where it is needed, and remote administration allows compute nodes to be monitored and controlled over the interconnect. Platform Computing’s Load Sharing Facility is widely used in the genomics field, and Sun’s GridEngine is a major product in the world at large.
Ranches vs. Farms
A lot of people build farms out of expensive, proprietary computers, but this misses the point. The great excitement of this technology comes from the ability to construct powerful yet inexpensive computers by exploiting the incredible price-to-performance ratios of commodity PCs.
Do the math: You can get a reasonable compute node with dual 800 MHz Pentium III processors for less than $1,000 per processor. The closest equivalent Alpha, a DS20 with dual 667 MHz processors, has a list price of more than $13,000 per processor. And a uni-processor Alpha, the DS10, costs about $6,800 per processor. For the Alpha to match the PC on a price-to-performance basis, it must first make up for this considerable cost difference with something like clever engineering.
To get a handle on price-to-performance ratios, I went hunting for published benchmarks that compare the PCs to Alphas or Suns. Amazingly, I couldn’t find any recent benchmark comparisons of these machines on bioinformatics applications. The best I could find was the SPEC CPU2000 benchmark from the authoritative Standard Performance Evaluation Corporation. This benchmark measures performance on floating point computations, such as computational chemistry, and integer computations, like executing a Perl program.
The results indicate that a 667 MHz Alpha DS20 is almost 2.5 times faster than an 800 MHz PC for floating point computations, and about 1.25 times faster for integer. This is an impressive difference from an engineering standpoint, but falls far short of the price difference between these products.
Compute farms are reaching maturity and belong in just about every high-throughput bioinformatics laboratory. Most groups have the expertise to configure and operate a small farm, but it’s probably wise to engage a consulting firm before doing your first large one. On the hardware side, the price-to-performance value of commodity PCs is so compelling that it seems best to use PCs as the workhorses of the farm, adding a smattering of Alphas or Suns for special purposes.
The toughest challenge, as usual, is software. Though farm management software has gotten a lot better in recent years, it still takes a lot of work to make an application glow. The experts are toiling away on this problem — when they finally get it licked, farms themselves could be the next bumper crop.
New Biology Farmers’ Tool Shed
Beowulf Underground http://beowulf-underground.org
Kuck and Associates http://www.kai.com
Platform Computing http://www.platform.com
Standard Performance Evaluation Corporation http://www.spec.org
Sun GridEngine http://www.sun.com/gridware
Virtual Interface Architecture http://www.viarch.org