Bioclusters are big. And most of them are getting bigger, according to a recent user snapshot compiled by BioInform.
A relative rarity just three years ago, Linux clusters have quickly gained popularity in the bioinformatics community as an effective, low-cost, high-performance computing option. No longer limited to small, underfunded academic groups seeking compute power on the cheap, clusters have also taken root within biotechs and pharmaceutical firms looking for a scalable complement to other supercomputing resources. An entire sub-industry has sprouted as a result, with everyone from IBM to small, independent consulting firms making their services available to the biocluster community.
BioInform recently polled 20 members of this growing population to get a better sense of how well Linux clusters are delivering on their promise. Users from 11 academic groups and nine biopharmas responded to an informal survey on how well the technology has lived up to their expectations so far and where it fits into their future infrastructure plans.
More a pulse-taking exercise than a statistically valid portrait of the user landscape, our efforts did reveal some interesting trends. Most significantly, respondents indicated that they are taking full advantage of one of the primary selling points of the approach — its scalability — by regularly adding new CPUs to their existing clusters.
Of the 20 groups surveyed, 17 have had a cluster in place for just three years or less, but 14 have already added new CPUs. The six groups who have not yet expanded their clusters said they plan to do so in the next year, as do seven other groups (see full results on p. 12). The average cluster size for the group increased from 81 CPUs for the initial installation to a current size of 426 CPUs. The starting size for academic clusters averaged 49 CPUs, vs. 126 for biotechs and pharmas. The current average size for the two sectors has grown to 167 CPUs for academic groups and 783 for biopharmas.
Almost half of the clusters in our survey (nine) were originally home-grown systems. Three of these groups opted for a vendor or consultant when it came time to upgrade the system. One academic group that started out with a homemade system and then turned to a vendor for an upgrade at the two-year mark said it is going back to a homemade approach for round three. IBM, VA Linux, and Rackable came in as the most common choices for vendor-built systems, although it should be noted that firms like Linux Networx, Blackstone Computing, Microway, and others have sold a number of Linux clusters in the life science market, even though their customers did not respond to the survey.
Keep it Simple (and Cheap)
The home-grown flavor of our sample may explain the surprisingly poor showing of Platform’s pricey LSF when it came to distributed resource management systems. An equal number of respondents (six) opted for home-grown job scheduling software or the open source Sun Grid Engine instead, with PBS (five) and Mosix (four) following close behind.
There were few surprises in the applications category, however. Proving that bioclusters are often dubbed “Blast farms” for a reason, 14 out of 20 groups run some flavor of Blast on their clusters, with the usual suspects of Fasta, HMMer, ClustalW, and RepeatMasker also appearing regularly.
Interestingly, none of the survey respondents opted for a commercial parallel Blast application such as TurboGenomics’ (now TurboWorx) TurboBlast, Paracel Blast, or Blackstone PowerBlast. This, again, may be due to the DIY leanings of the sample group: Four respondents indicated that they had developed their own parallel versions of the bioinformatics workhorse. One user noted that these commercial offerings “are only wrappers around NCBI/Wu-Blast and we are not very happy with them because of the costs or the programs they use.”
The Biocluster Bottom Line
When it came to judging how much bang Linux clusters deliver for their buck, the results were a bit mixed. While more than half the respondents (11) indicated that the price/performance ratio of their cluster beat that of other computational options as well as their expectations, almost half (nine) said that issues such as cooling and maintenance costs bumped the total cost of ownership for the system a bit higher than anticipated. Those who did their homework before installing the cluster — by speaking to other users and investigating all their available options — were confronted with fewer shocks, however.
One user was surprised by “how much heat the new AMD Athlon machines put out,” which led to “a few one-time startup expenses that relate to cooling.” Another simply noted that cooling is a “big deal.”
For those who opted to build their own, many underestimated “the effort of building and administering a cluster by ourselves.” The head of an academic research lab noted that despite the benefits of the cluster, “I am quite dependent on the expertise of one person (the PhD student who built it, who will leave the lab shortly).” Another bemoaned the “time required to customize applications to run on clusters,” while one user wished for “more off-the-shelf cookbooks on how to set up and maintain a cluster.”
Conversely, most who opted for vendor-installed systems seemed pleased with their choice. As one respondent put it: “The cost might have been much less if we had built the cluster ourselves. But this would have resulted in additional headaches in terms of maintenance of the machines. The cluster we have now has been running non-stop and no downtime in the last 12 months!”
While maintenance costs, I/O bottlenecks, and fileserver limitations were listed among the top drawbacks of the technology, for the majority of survey respondents, Linux clusters deliver a combination of low cost, scalability, and speed that far outweighs these inconveniences. One user explained, “we were able to do full human genome analysis in one month using only 16 Intel Pentium machines. Now [our] 26 new machines can do the exact same analysis in two weeks. All 42 machines together should be able to do that same analysis in little over a week. All this, for a cost much less than one mid-range computing system that would have an equal number of processors and comparable computing time.”
As another respondent summed up, the equation that describes why Linux clusters are growing so rapidly is very simple: “Need more power: buy more nodes.”