If Linux clusters are the new thing in computing for genomics, then harnessing the unused cycles of thousands, or even millions, of desktop PCs may be the new, new thing. And perhaps even a viable alternative to supercomputers.
Take Incyte Genomics. Just a couple of years ago, while Celera was setting up banks of Compaq Alphas for its sequencing efforts, Incyte was perhaps at the leading edge of high-performance computing when it opted instead for a 3,000-processor Linux cluster. Today, says Stuart Jackson, the company's technical computing architect, Incyte might have done things differently. "Let's say Incyte has 2,000 desktops in the company. And let's say that I can get use of those 2,000 desktops 50 percent of the time. That's 1,000 CPU days per day," says Jackson. "That's a fair bit of compute power without having to buy any more hardware."
Incyte has tested the products of two companies, Parabon Computing and United Devices, which both promise to build a virtual supercomputer out of a pharmaceutical or biotech company's thousands of idle desktop machines, on which some 90 percent of processor cycles simply go to waste.
"The price-performance benefit looks just like going to Linux clusters from minicomputers," says Jackson. "You're talking literally an order-of-magnitude savings."
In addition to economics, the performance capabilities are astounding. "In an afternoon, you put the software on 5,000 boxes at a pharmaceutical company and they just got a 5,000-node cluster," says Parabon CEO Steven Armentrout. "You can count the 5,000-node clusters on the planet on one hand."
Encouraged by the success of his [email protected] project, which harnessed more than 1.5 million teraflops of processing power from some 2.5 million PCs to analyze radio signals from outer space in search for intelligent life, David Anderson began United Devices in 1999 to take the idea commercial.
Although [email protected] has not found evidence that we're in the company of other intelligent life forms, United Devices is certainly not alone in the universe of distributed computing. And neither is Incyte in its willingness to look to the power of PCs for bioinformatics data crunching. Startups such as Parabon, Entropia, and the biotech and pharmaceutical company mainstay Platform Computing have all been looking to bring large-scale power on unused PCs down to earth and into the world of bioinformatics.
As this story went to press, Celera was about to announce that it would install Parabon's software on its 800-plus desktops company-wide to help handle abundant proteomics data. Celera and Parabon also plan to co-develop applications to run on the platform.
"We went in and we demonstrated a fairly large job that we estimated would take overnight," says John Grefenstette, director of Parabon Labs, the company's research division. "The job was started about 2 pm and was done by 5 pm. The desktop machines have much more capacity than most people appreciate."
Even traditionally conservative big pharmas are evaluating the use of their corporate desktops as a viable alternative to supercomputers and dedicated clusters.
Bristol-Myers Squibb and Novartis are both running pilot programs of Entropia's offering. "We're not just looking at it in genomics and bioinformatics, but we're also looking at it in other parts of the R&D pipeline," says Richard Vissa, executive director of global core technologies, informatics, at BMS. Other pharmas declined to talk about whether they had similar plans, but, says Vissa, "Most [of] big pharma is in some kind of evaluation. Everybody is kicking tires in some shape or form."
Well, not everybody. Jason Swift, head of discovery information systems for AstraZeneca R&D Boston, is skeptical about whether distributed computing is sophisticated enough for his operation. "What happens if they turn their computers off?" he asks, repeating one of the more basic questions that gets raised when anyone suggests that his department rely on computers on the desks of the company's marketing, accounting, and secretarial staff. "One of the reasons we perform large-scale computing within data centers is because they're in a controlled environment and we can predict pretty accurately the end point of a calculation. If the calculation is distributed on 10,000 desktops, it depends on what work others are doing on those desktops and whether they turn them off at night or not," he says.
Other issues, too, make desktop distributed computing a bit more difficult to implement than appears at first glance. While much of genomic analysis is well suited to parallel processing, it still means redrafting the code of dozens of algorithms and breaking them into lots of manageable chunks. Even so, some programs may never be able to be broken up successfully. Parabon's Armentrout says that porting the more than one million lines of Blast code — perhaps the number-one use of compute cycles in pharmaceutical and genomics companies — while a lot of work, was well worth the effort. "It's not a naïve port. We've found and repaired 15 major bugs in Blast in the process," he says.
But potential customers are still concerned about gigabytes of data clogging up the company's networks as the server constantly sends jobs out to various desktops, and whether employees, say, in payroll will complain that their computers are slower than normal. These are all issues to take seriously, says BMS's Vissa. But those who are not looking at the technology as a possible part of their compute architecture will be left in the dust.
Slicing Up the Work
One thing BMS is evaluating in a three-month pilot begun in late August is exactly the right size into which to slice calculations. On the one hand, if data chunks are at the upper limit of what the PCs can handle, if one of them goes down, you have to start all over. At the other extreme, if the units of work are too small, the server will be flooded with potentially thousands of PCs saying, 'I'm through, send me more work.'
"It's a matter of trial and error," says Vissa, "because each application has its own compute resource requirement."
BMS is also looking at how the distributed system scales. Several hundred PCs are currently involved in the pilot study and Vissa wants to know how the system will hold up when more are added.
There are also economic issues to factor in. For one, how do you extrapolate from a small study the number of staff needed to manage the architecture when installed throughout the company? "Do I need a dedicated system administrator for every number of PCs that I put on the grid? I don't have the metrics on that yet," says Vissa.
As for intruding on the work of the desktop users, he says, "I've had it on my desktop for over a month. I forget that it's even there."
Pros and Cons
Incyte did not, in the end, sign up for distributed desktop computing. It was so close, though, that United Devices CEO Ed Hubbard promised last year that the deal was imminent.
For now, Incyte decided that it already has all the compute power it needs in its cluster. "We're in the rather unusual position of having a very large computing resource available for bioinformatics," says Jackson. "If we didn't have such a large amount of computing resources available and the infrastructure and the network to support those machines, it would definitely be a viable option."
Distributed computing vendors may face the same issue at other biotechs and big pharmas, many of whom have recently ramped up their hardware to meet the challenge of proteomics. And with the cost of computers falling drastically, many see no need for an alternative power source. "The price of computing technology isn't that prohibitively high," says AstraZeneca's Swift. "I could spend $2 million on a Linux compute farm, place it in a managed environment in one of our server rooms, and know when the jobs I push to it are going to be returned to me."
Proponents of desktop distributed computing say that other costs must be considered, as well. "For instance, one of the costs of having equipment in the data center is square footage," says Jackson. "Then you've got power, air conditioning, and sys admins."
Besides, argues United Devices' Hubbard, the price-performance slope in desktops is much steeper than in high-performance computers. "The economics of buying a cluster are horrible," he says. "If you bought a cluster three years ago, you've lost in a big way, because you're fighting Moore's law and now it's worthless." On the other hand, "every three years you've completely turned over your desktops and you have four times the compute power."
Hubbard predicts that as PCs become more powerful and as network connections become fast enough to foster constant communication among computers across the globe, harnessing idle desktops may ultimately replace high-performance computing for many applications.
"I don't think you're going to eliminate supercomputers in the short term," he says. "But in the long term you're going to have these nodes on desktops that are capable of running almost every piece of software that you used to run on a high-performance machine. I'll tell you this: I'm not going to run out and buy any Sun stock anytime soon." Hogwash, says Sun's life sciences market segment manager, Loralyn Mears. The sheer volume of data expected to flood the industry in the near future will make dedicated clusters a necessity. "GenBank is growing at about six to eight times the rate of Moore's law," says Mears. "As much as it's going to be wonderful to tap the resource of all of those idle computers, we don't really see that it's going to diminish a lot of the cluster sales," she says. Sun views desktops as just part of a solution consisting of a grid of supercomputers and clusters. "Idle desktops can help you with those unexpected bursts of computational needs," says Mears.
Compaq's Ty Rabe agrees: "It's not really a threat to supercomputing." Even if one could imagine network connections approaching the speed of the communication between tightly coupled processors, networks have an inherent latency in starting contact with other computers. Just how big a chunk of the high-performance-computing market desktop distributed computing will snatch is still unclear. Riding high with confidence after grabbing Celera as a customer, Armentrout of the 25-employee Parabon boldly exclaims, "I think, very comfortably, 50 percent of it."
All vendors of distributed computing technologies see idle corporate computers as just the beginning. It'll take time, but the Internet will be the final frontier, even for pharma, they believe. The first adopters of the [email protected] Internet model will likely be smaller pharma and biotech companies looking to gain a competitive edge. "Because they are competing with the big guys, they are more prone to taking risks," says Martin Stuart, vice president of life sciences at Entropia.
In fact, exploiting Grandma's computer over the Internet is not out of the question for Incyte. "You really can't get as much of a savings as you would on the Internet," says Jackson. "We actually thought of some ways that were sufficiently secure that our internal data security folks were, at first blush, satisfied with." But, he adds, "That, of course, is a long way from any sort of adoption."
Incyte would also have to convince executives at pharmas, which make up the bulk of its customers, of the wisdom of this practice. Even then, the Internet is uncharted territory for IP; some wonder whether pushing data out on the Internet would be considered publication. Regardless, as at least some big pharmaceutical companies are considering installing distributed desktop platforms behind the firewall, it's only a matter of time before others will be clamoring to play catch-up.
"The question is not will it, but when will it really get traction. There's just too much unused desktop compute power sitting in companies," says Vissa. "When you have two-gigahertz desktops on people's desks that are basically using e-mail and Microsoft Word, people are going to start asking, 'Why do I have all that compute power there? And why can't I utilize it?'"