Beowulf guru Doug Eadline says Linux clusters for genomics computing “are going to be everywhere.” Fox Chase Cancer Center is Exhibit A.
by Adrienne Burke Doug Eadline stacks a dozen of his business cards into a deck to illustrate a supercomputer. Each card represents a component of the system — storage, computing hardware, debugger, scheduling tool, and so on. “Back in the old days, you went to Cray and got this whole thing in a neat package,” Eadline says. “They came in, plugged it in, and turned it on for you.” But times have changed. Eadline scatters his cards across the table. “The Beowulf Bomb basically blew this thing up,” he says. Eadline, who is president of the Bethlehem, Pa., firm, Paralogic, that designs scientific computing clusters with commodity parts and Linux software, dates the drop of the “bomb” to about five years ago. That was when open-source operating systems such as Linux had become reliable enough, PC processors cheap enough, and network connections fast enough, that an experienced IT team at a place like a national laboratory could build a parallel machine on its own. Since then, the most savvy in-house computing shops have been saving millions of dollars by doing the dirty work of lashing together sets of processors with separately sold components instead of buying supercomputers. Nowadays, in a so-called “Beowulf cluster” (see sidebar), the power of a readymade million dollar supercomputer can be gotten for a five-digit figure, Eadline says. At that cost, supercomputer-purchasing decisions can be made at the department level, he notes. Indeed, no longer such high-ticket items that they require executive signatures, supercomputers — in the form of racks of linked-up computer boxes — are creeping into even the smallest labs and research departments. And with their burning desire to download, store, and find their own way through the human genome, molecular biology labs are among those saving their pennies to buy a Beowulf. “There are all those people who could never afford a supercomputer that now can think about solving their problems with that type of processing power,” he says. “You can let your mind wander and say, ‘Gee, if I had this many [processors] I could do this.’” Suddenly, storing GenBank data, running BLAST searches, and modeling protein structures are jobs that a bioinformatics department with a modest budget can afford to tackle. Take Paralogic’s newest client, the Fox Chase Cancer Center in Philadelphia. The place doesn’t yet have an Internet connection fast enough to download GenBank data more than once a week. But for pending proteomics and gene expression analysis projects, Michael Ochs, Fox Chase’s bioinformatics manager, wants supercomputing-class computational ability. He doesn’t have a budget for a high-end Cray. Nor does he want to bog down his three-person staff with the task of building a computer cluster. So Ochs logged onto the Web a few months ago to find somebody who could build him a 128-node cluster. Within a few weeks he had signed a $30,000 contract with Paralogic to start him out with an eight-node, 16-processor Linux system and a long-term service contract. If the NIH shared-equipment grant that he’s hoping for comes through, Ochs will likely retain Paralogic to add another 120 nodes. Not surprisingly, bioinformatics groups and genome research labs top Paralogic’s list of prospective customers. Eadline, whose first brush with genomics was a 1994 SBIR grant from the US Department of Energy to develop parallel computing methods for grammar-based genome database searches, sees a huge business opportunity in building Beowulf clusters for smaller-scale genomics research. Eadline doesn’t expect that customers such as Fox Chase will ever expand their clusters on the scale of a national lab or Celera. And, to be sure, his $30,000, 16-processor machine is priced “aggressively,” Eadline says. But he and his colleagues are quite confident that by sniffing out more folks like Ochs, they’ll hit paydirt. “Sure, there will be 1,000-node clusters,” says Eadline. “But how many? Maybe hundreds. But individual departmental clusters? We’re talking hundreds of thousands of those. They’re going to be everywhere.” Cluster Competition To be sure, Paralogic wasn’t first to figure out that genomics is a compute-cluster gold mine. Compaq is notorious for the massive cluster of Alpha processors it installed at Celera and the Sanger Centre. And Blackstone Technology, the compute-farm consultant that assisted Compaq on the Celera project, is steadily collecting clients in the sector. Its newest genomics customer is Biogen, for whom Blackstone will construct a farm of 150 Intel CPUs running on Linux. Supercomputer-maker SGI now sells the Linux cluster of Intel processors that Incyte’s bioinformatics staff built for its own genomics tasks. And IBM is building a computing cluster for Structural Bioinformatics. In addition, numerous other Beowulf builders have begun winning jobs in the genomics market: Linux Network at Rosetta and Lawrence Berkeley National Lab, Microway at Millennium and Pfizer, and RackSaver at Scripps, Novartis, and MolSoft. Nevertheless, CEO George Palmer says Paralogic, which has arguably already built more Beowulf clusters for scientific computing than anyone in the world, aims to “become the genomics high-performance computing company.” Does the cluster competition from the big boys — IBM, SGI, and Compaq — worry Eadline? Not really. “Look at the doomsday scenario. OK, Bethlehem gets nuked and there’s no more Paralogic. What do they have? Commodity hardware, open source software, software that they’ve purchased that nobody has a lock on. So they can go out and hire somebody or do something,” Eadline argues. Palmer admits it’s a double-edged sword. “There’s nothing saying they have to stick with us. We know we’re competing against every other solution that’s available. We’re OK with that. We’ve done well with that.” With genome research clusters for Novartis and Fox Chase under its belt, and two other undisclosed university-affiliated projects in collaboration with Dell underway, Palmer says the “couple million dollar” firm is in conversations now with venture capitalists. He plans to scale up business and recruit bioinformatics and genomics experts to the 16-person staff. Within six months, Palmer says he hopes to have tripled his staff size. Paralogic’s current client list glimmers with big scientific computing names including Amerada Hess, Lucent, MIT, NASA, Phillips Petroleum, Procter & Gamble, and the US Air Force. But when Paralogic CFO William Hanlon cites an Oscar Gruss forecast for the genomics market, it’s little wonder the 11-year-old company is rewriting its business plan to cater to it: Genomics computing is on track to grow within three years from its current $300 million to a $2 billion industry. Fox Chase Bakeoff Instead of a formal RFP, Michael Ochs sent a 10-question email to four prospective Beowulf builders in mid-September last year, noting Fox Chase’s need for a 128-processor cluster to handle “genomic searches, gene expression analyses, and proteomics analyses.” Ochs told vendors that two main types of programs would be run on the cluster: “Highly parallelizable search algorithms in which databases can be divided and searched independently” and “computational programs that must make frequent comparisons between multiple versions of code, then delete some versions while duplicating others in an iterative scheme.” Ochs says public genome database searches had begun taking several hours, so his group was anxious to at least get an eight-node proof-of-concept cluster in place. Just more than 100 researchers are involved in genomics activities at Fox Chase, a 100-bed, full-service cancer facility, with outpatient treatment and a clinical research lab. In genome research circles, the place is known for the early SNP-finding work of Ken Buetow, who has since moved to the National Cancer Institute. In addition to downloading various public databanks, the center maintains several of its own repositories, including large disease population databases, and functional genomic data, all of which Ochs plans to integrate. “What I’d like to do with bioinformatics is link the outcome research — the knowledge of how the patient did in the long term — to the genomic data for that patient, to the functional data about what genes were expressed, to proteomic data about the structure of proteins in the tumors,” he explains. Ultimately, Ochs, an astrophysicist with a master’s degree in Celtic languages who has made a career of following “the most interesting question,” also plans to parallelize his own matrix factorization algorithm, Bayesian Decomposition, to perform functional genomic analyses using the cluster. Because the program runs floating point calculations, which are known not to perform well on Pentium processors, Ochs knew he would need a combination of machines flexible enough that some or all CPUs could be upgraded eventually from Pentiums to Alphas or G4s. Considering those requirements, Ochs’ questionnaire asked vendors: Which version of Linux do you use? If we initially choose Ethernet can we later upgrade to a faster backplane? Does your cluster support a fiber channel Storage Area Network instead of maintaining local storage on each machine? Do you have a disaster recovery method for the cluster? Are the CPUs in your cluster upgradeable? What is the maximum RAM that can be used per node in your cluster? What is your typical time from order placement to delivery? Ochs also knew he wanted a machine “near the top in speed.” He explains, “I’ve always avoided buying the fastest processor, because there’s always a large cost involved in buying the fastest.” While he believed he needed at least 800 megahertz, Ochs says he was really looking for a vendor who could help him make the right decision. “I got one answer that I thought showed that the people didn’t understand systems very well,” Ochs says. And two of those three remaining hardly had a chance: Paralogic’s Palmer and Eadline cut to the chase. Says Eadline, “They asked us for a quote, and we said, ‘Before we quote anything we want to come down and talk and see what you guys want to do’.” Ochs, Frank Manion, director of research computing, and their colleagues put Eadline and Palmer through the wringer and were impressed by what they heard. “They were able to talk not only about the system but the specifics of the scientific problem,” Ochs recalls. “We would say something like, ‘Our first process is going to be to move BLAST onto the system so scientists can get their work done,’ and Doug had already thought about the problem. None of the other people I talked to came across that way.” Says Palmer, “[We were] extremely interested in this process they’re going through and this datamining challenge. These are the areas in which Paralogic sees itself separating itself from the competition.” Fox Chase signed off on the paperwork within two months and was expecting its rack of eight dual-processor machines running Linux with Fast Ethernet networking to be delivered by New Year’s Eve. For his $30,000 Ochs gets not just the cluster, but access to Eadline’s expertise and what Palmer calls “handholding” through other decisions about whether and how to upgrade the system or how to optimize software to run on the cluster. Ochs will also send three of his staff to Paralogic’s Beowulf administration training in Bethlehem in January. Says Eadline, who hosts the course twice a year, “There’s nowhere else in the world you can go and learn how to configure, run, and upgrade the software for your Beowulf. Everyone gets a terminal and they get to pick my brain for a week.” Cooking up Clusters Eadline, 44, started his company in 1989 in a state-funded incubator at Lehigh University, where he earned a PhD in physical analytical chemistry. Paralogic now operates out of an office park on Bethlehem Steel land. An abandoned steel mill and its rusted-out blast furnace looms a few hundred yards down the railroad tracks from the building where Eadline and his team are constructing the BLAST furnaces of a new industrial revolution. Hanlon and Palmer, both computing industry vets, came aboard in the past year to help Eadline grow Paralogic from a software and hardware provider into a full-service Beowulf cluster consultancy. The three put in regular 12-hour days, and Hanlon and Palmer make hour-and-a-half commutes home to their families only on weekends. Says Eadline, “We’re seeing Beowulf maturing into a real commercial entity and there are vertical areas, the primary one being the genome area, that we can sell into.” On Thanksgiving Eve, a blustery 35-degree day in Bethlehem, the staff has headed home for the holiday, while Eadline, Hanlon, Palmer, and Chief Operating Officer Susan Rennig sit around a conference table discussing how they plan to corner the genomics computing cluster market. Catering to each customer’s unique needs will set them apart, they say. “Here’s the answer to every question in parallel computing,” says Eadline. “It all depends on your application. Someone comes in and says, ‘I’ve got $100,000. Should I get 32 processors and Fast Ethernet, or should I get 16 processors and Myrinet?’ That’s a big question.” Unlike a traditional hardware company, Eadline contends, Paralogic is equipped to give each customer no more and no less than it needs in a computer. “What they’re saying is, I don’t need a nine-gig SCSI drive in every node. I don’t want to pay for that. And I don’t need a highfalutin video card in every box.’ We can say, ‘OK, since we build them, we’ll just put in exactly what you need.’ That lets us fine-tune the price-performance much better than some of the bigger guys.” Having put away his deck of business cards, and the holiday feast looming, Eadline evokes another metaphor for computer components: “If I go to a store and buy a frozen turkey, a can of beans, and some stuffing mix, and put it all on the dining room table, is that a Thanksgiving dinner? Not until you put it together the right way and make sure everything tastes good.” A government lab, he says, has the “geek potential” to cook up all the ingredients into an effectively configured system. But a Fox Chase Cancer Center needs someone like Paralogic to do the kitchen work. “So,” Eadline deadpans, “we’re the Martha Stewart of clusters.” A Brief History of Beowulf The term Beowulf has a certain caché to it, but to purists, not just any cluster of computers qualifies for the label. Doug Eadline, whose Beowulf How To guide (www.plogic.com/bw-howto.html) has made him a luminary in the cluster-computing subculture, says the term was defined narrowly by the creators of the first Beowulf cluster, Don Becker and Tom Sterling. “It’s PC hardware with widely available networking and one of several possible open-source operating systems,” says Eadline. That means a cluster of PCs running Microsoft Windows NT doesn’t qualify. Nor do a bunch of Alpha processors running Compaq’s Tru64 operating system. Nor does a Sun Sparc, unless it’s running Linux. Becker and Sterling were at NASA’s Goddard Space Flight Center when they conceived Beowulf in 1994. “It happened to be a time when vendor-neutral machines were becoming powerful enough to act as compute nodes,” Becker recalls. He and his colleague, who first met as MIT computer science grad students, decided to try to change the way high-performance computing was done. “It used to be dominated by the supercomputing crowd,” Becker says. “Today that’s an insignificant part of the industry.” Becker credits Sterling with christening their first 16-processor cluster after the Old English fictional character. “There’s a line in the story that goes, ‘Because my heart is pure I have the strength of 1,000 men,’” Becker says. “Our goal was to gather contributors from around the community to build a cluster computing system. We started around an open-source operating system so everyone could build their own, royalty free.” Their movement has drawn many more than 1,000 men. Becker says about 1,800 Beowulf builders are on the community’s mailing list. Today Becker’s software firm, Scyld Computing, named for Beowulf’s Scottish clan, works with Eadline’s Paralogic. Says Becker, “Beowulf clusters are wonderful for the kind of computation I understand genomics to be — fairly easily parallelized.” Plus, he adds, “They enable people to do better science at lower cost.” — AB