North Carolina began dismantling its statewide supercomputing center over the summer, but that may be the best thing that ever happened for biomedical computing at Duke University. In the wake of the closure of the North Carolina Supercomputer Center (NCSC), Duke genomics and bioinformatics researchers now have a computing architecture selection that was not available to them previously.
Last month, Duke sought to fill the compute vacuum caused by the closure of the NCSC by unveiling two new high-performance computing systems: a 108-node shared cluster resource and a Sun Fire 12K server with 256 GB of memory. While the systems are available for all university researchers, Duke’s bioinformatics and genomics community has laid claim to a large portion of the available cycle time. The Sun Fire system is housed at the university’s Center for Human Genetics and dubbed “the Genominator,” while Duke’s Center for Bioinformatics and Computational Biology (CB<sup>2</sup>), has reserved 32 nodes of the communal-style cluster farm.
“A lot of the duties, a lot of the practices that were done at the supercomputing center are now being absorbed by the university,” said Bill Rankin, director of the cluster and grid technology group at Duke’s Center for Computational Science, Engineering, and Medicine (CSEM). When the NCSC shut down, some established user groups such as engineering and the physical sciences already had departmental computing resources that they could fall back on. However, Rankin said, “a lot of the computational biology groups had immediate needs and we were in place at the time to help them out.”
The NCSC was established in 1987 as a component of MCNC, North Carolina’s non-profit IT organization. At first it was supported by the state, but funding responsibility shifted to the University of North Carolina in the mid-1990s, and then to MCNC itself through a fee-based access system. Facing a cash crunch earlier this year, MCNC opted to shift to a grid-based computing model using commodity systems as a lower-cost alternative to replacing the high-end IBM, Cray, and SGI supercomputers installed at the center.
MCNC began reallocating the NCSC’s remaining computing resources to universities across the state this summer, with plans of linking the entire university system via a statewide grid. The catch, of course, is that the grid is a work in progress, leaving some researchers who relied on the NCSC resources in the lurch.
Duke, however, saw the intermission in centralized compute resources as an opportunity to beef up its local computing power. “The grid won’t replace traditional supercomputing,” said Rankin. “It will essentially add to the resources that you have available, but in some cases, you’ll still need local, dedicated resources.”
As a result, bioinformatics and genomics researchers at the university will have access to a variety of architectures, and will be able to choose the best option to suit their biomedical computing needs.
Cluster Farm, Stone Soup-Style
Much like the peddler who starts off a pot of “stone soup” with a stone and a pot of water in order to attract villagers who add ingredients to create enough soup to feed the entire town, Duke is counting on contributions from its research departments to build a university-wide compute cluster. In Duke’s case, the “stone” is a core of 64 Dell 1750 Intel-based compute nodes, and the “pot” is a dedicated machine room and 24/7 support and maintenance. CB2 was among the first of Duke’s “villagers” to add its own set of 32 nodes to the communal farm, and Rankin expects other university research groups to contribute, as well. Duke expects the cluster to grow to 200-400 nodes by the fall of 2004.
The carrot for research departments is the elimination of expenses associated with cooling, systems administration, and other overhead costs that the CSEM covers as part of the shared system. “Even if we had our own cluster, we wouldn’t be able to hire a full-time person just to look after it,” said Tom Kepler, interim director of the CB<sup>2</sup>.
Rankin said that researchers can opt to run their jobs on either the generic, system-wide pool of available nodes, or on a high-priority queue that will use only their machines and suspend lower-priority jobs run by anybody else. “That was very important,” Rankin said, because “some research [equipment] grants are fairly restrictive if you start buying hardware that is going to be used in a shared resource.”
Craig Henriquez, an associate professor of biomedical engineering and computer science, said that his research team was “struggling to find a solution” following the closure of the NCSC. So far, he said, the cluster farm offers “at least 70 percent of the capability that we had before.” The NCSC “was a little bit better tuned to manage a number of users using the same facility, the queuing system was much more mature, and the data access was much quicker because of the way the file architecture was set up and the nature of the switching and the nature of the network,” he said. However, after experimenting with several national supercomputer centers, a departmental cluster, and several other options following the dissolution of the NCSC, Henriquez determined that the shared cluster was the best alternative available.
Cluster computing is gaining prominence in science and engineering, but, as Henriquez pointed out, “The challenge is that a cluster of 32 machines or 100 machines is not something that people can stick in a corner of the room.” Duke’s collaborative approach should work, he said, because “it’s like an in-house supercomputer center, but people are guaranteed time on those nodes.”
Of course, cluster computing doesn’t work for every bioinformatics application or research problem, so the shared resource will be augmented by the symmetric multiprocessing architecture of the Sun Fire 12K server, which has 32 1.2-GHz UltraSparc processors, 258 GB of memory, and more than 7.9 TB of storage.
“The Sun Fire system is like a big tractor-trailer compared to the shared cluster, which is like hundreds of bicycles,” said Tracy Futhey, Duke’s vice president of information technology and chief information officer. “The bicycles can carry vast numbers of small packages but are useless for moving a huge cargo container. The Sun Fire system can use all 32 processors and all 256 gigabytes of memory for a single task.”
The Sun Fire’s shared-memory system will be ideal for running large-scale biological simulations, Rankin said, noting that the 64-bit architecture of the system is an additional advantage for those applications that can’t run on the Intel 32-bit architecture of the cluster.
Judy Stenger, an associate research professor at the Center for Human Genetics and the PI on the $3 million NIH grant that funded the purchase of the Genominator, said she plans to use both the Sun Fire and the cluster system for her research.
Think Globally, Act Locally
In addition to the cluster farm and the Sun Fire, bioinformatics and genomics researchers at Duke will soon have a third computing option: the MCNC-led North Carolina BioGrid, which hit a speedbump when MCNC reorganized its resources after shutting down the NCSC. Rankin said that Duke participated in the testbed phase of the project [BioInform 06-24-02], “but it was really more of a proof of concept.” With MCNC back on track, however, Rankin said, “we’re starting to ramp up to get the project progressing a little quicker now.”
MCNC spokesman Scott Yates said that the early success of the BioGrid influenced MCNC’s decision to shift all of its computational resources to a grid-based model. “The BioGrid has been invaluable for us in taking this next step forward in grid computing,” he said. MCNC has committed $6 million over the next three years to deploy the grid infrastructure statewide, he said. “Instead of having these big, monolithic computers, those are evolving really into cluster computing … so it really becomes several clusters that become a grid, and then the grids interconnect with each other.”
The benefits — and even the long-term feasibility — of grid computing have yet to be proven, however, so Duke has taken the necessary steps to ensure that it remains computationally self-sufficient. Currently, Rankin noted, there is no support infrastructure for grid computing similar to that offered by the university’s shared cluster. Furthermore, he said, “there will always be applications and research needs that cannot be met by grid computing,” which will require large shared-memory systems like the Sun Fire. Another drawback of the grid model is that “you need to rely on the good nature of your neighbors to make cycles available to you,” which is not always an optimum situation for researchers on tight deadlines.
Even on the grid, Rankin said, “the only thing you’re guaranteed is what you have direct control over, and that’s your local resources.”