By John S. MacNeil
Cluster computing: Networked computers, on a local level, working together as a parallel machine. The work is split across multiple computers, but they behave in a parallel fashion.
Distributed computing: A type of system that divides a workload to computers connected to a network. The network may either be enclosed in a room or out in the open, like the internet. [email protected] is one example.
Grid computing: In theory, an approach that aims to make IT infrastructure invisible to the user by applying distributed computing to a shared network of databases, algorithms, and computing resources over the internet.
It used to be a cluster, then it was distributed, now the fashionable term for teaming computers together is grid computing. It’s hard for most non-computer scientists to keep track of how fast the buzzwords evolve — or exactly what they mean.
Not to mention applying the new technology behind the jargon to their research.
Even for a computer scientist like Ernie Retzel, the director of the Center for Computational Genomics and Bioinformatics at the University of Minnesota, actually doing grid computing is a formidable challenge. First of all, there are the startup costs associated with acquiring some sort of computing power to offer others on the grid, then the problem of configuring the compute resource to make it accessible, and, last but not least, the necessity of forcing yourself to trust that fellow researchers won’t screw up your system.
Suffice it to say, for the average user looking to remotely access the data, algorithms, and computing resources they want, when they want it — the ideal manifestation of grid computing — there’s quite a bit of work to do. Putting together a cluster of CPUs or creating a distributed computing network using idle PCs may mean grid computing to some people, but to the purists, grid computing as a shared network of databases, algorithms, and computing resources is still struggling to reach maturity. “You can’t unfortunately get anything that’s just plug and play,” Retzel says.
“If you really wanted to participate, you’d probably bring your own small cluster so you had something to contribute,” Retzel says. “After that, it’s mostly system skills, enthusiasm, and patience, and not necessarily in that order. At this point it’s very much a researchy world.”
Which means that grid computing may sound great in theory, but until academics work out the kinks, doing it properly isn’t something most researchers will have a chance to experience anytime soon. Small-scale, distributed computing networks within one organization — what one might refer to as an in-house grid — are currently feasible, but that’s not necessarily anything new. (Vendors like Oracle and Sun continue to release more powerful software for managing distributed compute jobs, making in-house clusters more efficient.) For most researchers, entering a query from your desktop and having the grid pull together the required data, bioinformatics algorithms, and compute resources necessary to complete the analysis remains strictly a vision of what someday might be.
Furthermore, there’s the question of what types of compute jobs are best suited to a distributed environment — regardless of whether that environment is in-house or spans multiple institutions. One reasonably powerful server may be enough to handle isolated Blast searches or image analysis, but for problems that can be easily broken up into discrete chunks and then reassembled — such as simulating how a large set of drug-like compounds interacts with one or more proteins — applying a distributed computing solution makes more sense.
Despite the challenges, there are a few efforts to put grid computing to practice. A look at these efforts may offer some clues as to how the field is evolving, and at what point you’ll be able to tap into the network — assuming you decide you want to.
One of the more advanced examples of the potential capabilities of doing grid computing lies in North Carolina. Established in 2001, the project combines efforts at NC State, Duke, the University of North Carolina, and several non-profit computing consortia and IT vendors. The project is managed by MCNC, a non-profit organization set up in 1980 by the North Carolina General Assembly to foster the development of high-tech resources in the state.
Wolfgang Gentzsch, a former Sun executive who in April joined MCNC as director of the grid computing and networking services group, sees grid computing as adding more powerful components to the infrastructure of the internet, allowing researchers to communicate, collaborate, and run applications across a distributed computing environment.
So what does this mean in practice? Grid computing efforts in the NC Biogrid project currently take two forms: an R&D-intensive “testbed” initiative that develops grid applications for biosciences by designing middleware to link remote users with applications and compute resources off-site, and the “enterprise” grid, essentially a pay-per-use high-performance compute cluster that’s made available to research groups and businesses when they have problems best suited to a distributed computing solution.
At the moment the testbed grid is predominantly an academic endeavor, says Chuck Kesler, a systems analyst for MCNC. The goal is to take applications and build the appropriate hooks to connect them with grid middleware so that the programs work in the distributed grid environment spanning MCNC and the three universities in the Research Triangle Park area, he says. “I wouldn’t want to set any expectations that this is some sort of resource that just has incredible capabilities that you can’t get anywhere else,” he adds, “but it has been very useful from the standpoint of providing a good platform from which to develop these next-generation applications.”
The enterprise grid, on the other hand, is designed to make some of these technologies useful today. Because the system is essentially a medium-sized cluster — it has a theoretical performance on the 64-node cluster of 716.8 gigaflops, with the 32-CPU Linux SMP server running at 166 gigaflops — some would say it doesn’t qualify as grid computing in the purest sense of the concept, but Kesler says it does provide a distributed computing resource that can handle commercial customers’ needs for data security and access controls. Several public and private universities in North Carolina currently use the resource for computational chemistry applications, and Barons Advanced Meteorological Services uses it for atmospheric modeling and weather forecasting. In the coming months, Kesler’s group will begin deploying many of the biological applications developed on the test grid on the enterprise system, he says.
Expanding the enterprise grid beyond the bounds of one organization is still the sticking point. For one thing, most users, particularly in the private sector, want to have control over how their contribution to a multi-institutional, or global, grid is used by others, Kesler says. “From the perspective of the global grid, the technology is still relatively immature in developing,” he says. “When you talk about grids inside of an organization, inside of an enterprise, the technology is a bit more mature in that respect … but these are problems [with the global grid] that will be solved with time.”
Another attempt to put grid computing principles to practice is being led by Carole Goble, a computer scientist at the University of Manchester, in collaboration with researchers at the European Bioinformatics Institute and other organizations. Goble’s pet project is MyGrid, an attempt to design the middleware necessary for linking various computing applications — primarily in the biological sciences — with a network of remote users. Working with efforts such as Globus’ Open Grid Service Architecture and other standards groups, Goble and her colleagues recently put together their second prototype version of MyGrid that takes a Web services approach to providing a standard interface and set of protocols for sharing compute resources, data, and algorithms. The goal, Goble says, is to make the distributed platform appear as a single entity to the end user.
At the moment Goble and her collaborators are primarily developing the middleware infrastructure, and they leave the testing and practical application to others, such as research groups at the San Diego Supercomputing Center and several pharmas, including GlaxoSmithKline. IBM Life Sciences has also assisted with service registry technology, data management, and life science identifier and other tools.
Academic users have employed MyGrid to set up analytical workflows for problems in yeast comparative genomics, and in studying genes associated with Graves’ Disease and Williams-Beuren Syndrome. At the San Diego Supercomputing Center, researchers are using MyGrid to create a standard procedure for annotating protein structures. “The task is to orchestrate publicly and locally available services such as databases and applications (like Blast), into workflows in order to answer a scientific question,” says Goble. “The high-level goal is to automate these procedures and make them explicitly reusable and shareable.”
Goble anticipates that MyGrid will soon have a somewhat larger following. In about 10 months, when funding for the project runs out, Goble expects to have completed a third prototype of the MyGrid software, a version that will be made available over the internet for any interested bioinformaticists. “We’re revising our portal, we’re revising our information repository, we’re revising our registry, so that we’re able to get those all out of the door by the end of the project,” Goble says. “And then you can just go and download it.”
By downloading MyGrid, Goble says bioinformaticists can create a customizable portal through which they write and run workflows, register workflows and services to share with others, and manage and link together metadata such as provenance and historical records, among other tasks. What MyGrid doesn’t handle, Goble admits, is the low-level scheduling of compute jobs or providing a full security framework.
At Novartis, grid computing — or at least its version of the concept — is a little less pie-in-the-sky. Like a few other big pharmas such as Bristol-Myers Squibb and Pfizer, Novartis has assembled a distributed computing network that relies primarily on the spare computing power of the company’s many desktop PCs. Under the direction of Manuel Peitsch, Novartis’ global head of informatics and knowledge management, this in-house cluster of 2,700 PCs has been used to screen drug-like compounds for interactions with certain proteins, as well as for other “parallelizable” computing applications, such as text mining and text analysis and in gene discovery applications like Blast, Peitsch says.
As an example of the kinds of computations Novartis’ in-house grid can accomplish, Peitsch and his colleagues published a paper in the Journal of Medicinal Chemistry last year describing how the group discovered a previously unknown inhibitor of the protein kinase CK2, which has a possible role in cancer. Devoting the distributed compute cluster to such compound-screening problems currently represents the best use of the in-house grid, Peitsch says, because these problems have “the most need for additional computing power.” At the moment Novartis’ system has a peak performance of 5 teraflops, and uses Grid MP Enterprise software from United Devices to manage the system.
However, Novartis’ system still has a ways to go before it can truly be called an example of grid computing. For one thing, it uses a command line interface available only to a select group of bioinformaticists, limiting its practical benefits to researchers hoping to submit their own jobs on the fly. (Peitsch says his future plans for the system include a Web-based interface.) Furthermore, data security and system compatibility issues currently prohibit the Novartis grid from hooking up directly with data and computational resources outside the company. Such a system, Peitsch says, is still at least three to five years away from reality.
So where does this leave you?
The upshot, then, is that aside from building a distributed system dedicated only to in-house research, the promise of grid computing remains unfulfilled. IT vendors such as Sun and Oracle have made progress designing better software and databases for distributed computing applications, but connecting these up with a global grid remains a primarily academic endeavor.
Retzel at the University of Minnesota and others are confident researchers will someday overcome the obstacles to global grid computing, but he’s wary of those who claim that a revolution in computing power and convenience is imminent. “There’s a set of long-term goals in there that I think will be realized, but we’re not there now,” he says. “Right now we’re in that awkward phase [when] you know where you need to get, you know what the tools are, [but] getting there is a lot more than drawing a flow chart on a blackboard.”