By Aaron J. Sender
What if you could call off your hunt for all the data you need? Forget about trying to run analyses of your favorite gene from your desktop, or whether you’ll have to book time on a supercomputer or ship out to a Linux cluster to get your job done.
Imagine, instead, simply punching in your query and something called “the grid” would just do it all for you. It would pull together all the bioinformatics applications, shuffle the data from the appropriate databases, and crunch various components of your analyses on just the right arrangement of computer processors. Your job could be like something straight out of Star Trek: “Computer, tell me whether this protein is involved in any pathways related to schizophrenia, give me its 3D structure, the level of expression in all brain cell types, and a list of all known chemical compounds that bind to it, including a toxicological profile of each.”
“Grid nirvana,” as Abbas Farazdel calls this fantasy, is the future for basic biological research, drug discovery and development, and even diagnostics research. Or at least that’s what some academics and industry players such as IBM, Sun Microsystems, HP, SGI, Dell, Platform Computing — to name a few — would like you to believe.
Farazdel, a solutions architect at IBM Life Sciences, is also co-chair of the recently created Life Sciences Grid Research Group of Globus, an organization set up to put in place the standards, technology, and policies to make grid computing a reality. “Grid nirvana is like a supermarket of services,” he explains. “On the Internet right now we buy, sell, and share HTML files. In the grid nirvana we’re going to be selling and buying everything, all resources.”
The grid goal is to make IT infrastructure invisible to the bench scientist. “In bioinformatics, researchers are spending 20 percent of the time just staging data — just finding it, figuring out who owns it, getting a copy of it, and moving it around,” says Philip Werner, VP of product management for grid software vendor Avaki. “So 20 percent is just wasted time with them being forced to be essentially system administrators.”
With the nearly boundless computing and data access the grid vision promises, you could be not only more productive, but also able to tackle new classes of problems, Farazdel says. You could. But will you?
Fantasy vs. Reality
Already, computing vendors flaunt lists of life sciences customers for whom they’ve installed some form of distributed computing or managed to decentralize databases across a network. And regional consortiums, from North Carolina to Japan, are rolling out what they call grids. But true grid nirvana, the ultimate system that Farazdel envisions, remains out of reach.
Getting there will have as much to do with sociology as with computer science. And although these challenges apply to anyone trying to establish a grid, biology is saddled with its own set of unique problems. “The physics folks have been doing grids for years,” says Ernie Retzel, a grid-computing enthusiast and director of the Center for Computational Genomics and Bioinformatics at the University of Minnesota. “The biologists, myself included, haven’t even gotten their hands around all of the problems yet — but we know they’re really big.”
In fact, grid computing is so far from reality that even the fundamental question, “What exactly is a grid?” is a topic of heated debate among grid proponents. “If you put 35 people in a room and ask them what grid computing is, you get 48 answers,” says Walter Stewart, SGI’s global coordinator for grid strategy. That’s because a grid is more an idea of where computing is heading than a particular set of technologies. “People dig themselves into a hole when they look to have the grid defined as though it were some kind of technical plan,” says Stewart. “It’s instead a metaphor for an approach to computing.” Simply put, grid computing means being able to tap into power and data whenever you want it, wherever you want it.
The most popular metaphor cited is the electrical power grid. When you plug a toaster into an outlet you probably don’t think about where the electrons that help get your bread to just the right level of crispiness are generated. Or whether they are derived from coal, hydroelectric, solar, nuclear, or wind power. In computing, relying on the desktop’s processors and memory to provide the required power and data is akin to having a separate generator attached to each home appliance. It’s not only limiting, but extremely inefficient.
Yet it’s easier to get customers to fork over hard cash for a particular technology with immediate and measurable results than it is to sell an abstract idea. Rick Stevens, project director for TeraGrid, an ambitious NSF-funded attempt to create a nationwide network of supercomputers for broad scientific research, points out, “If you’re a company and you’ve got some product and you’re trying to market it and then all of a sudden a concept gets coined that you think is one way to view what you already have, naturally there’s going to be a lot of re-labeling going on.”
As a result, today vendors are selling everything from enterprise software to ways of harnessing idle desktop processors — some call it “cycle stealing” — as complete grid technology products. “It’s like having 10 blind people describe an elephant by touching it,” says Retzel. Each is offering a component that might be part of a larger grid concept. “People tend to look at things that are very narrow and decide, for example, that cycle stealing is the answer,” says Retzel. “And that’s because it’s all they can afford to get their hands around, but it’s not the answer. It’s just part of it. Some folks have to step back far enough to say, indeed, this is not a grid but pieces of the grid.”
Big Grids for Big Pharma
Still, life science organizations, from academic institutions to big pharmaceutical companies, are beginning to experiment with these components, exploring what turning thousands of disparate computers into a single mega-ultra-supercomputer can do for them.
Take Bristol-Myers Squibb, for example. “Grid computing is getting a lot of publicity lately, and there are various flavors and forms,” says Richard Vissa, executive director of global core technologies, informatics at BMS. “The way we view grid computing here at Bristol-Myers Squibb, tactically, is, ‘How do we exploit the power of our desktop computers or PCs?’” After auditioning two distributed PC computing vendors last July, BMS signed up with Platform Computing to start hooking up its systems. Today the NJ-based pharma is running virtual screens of compounds against protein targets on several thousand PCs across its research divisions and is exploring what other kinds of problems are a good fit for the distributed PC model.
But that’s as far as BMS is willing to go toward the grid, at least for now. “We’re taking a more cautious, targeted approach right now,” says Vissa. BMS sees desktop cycle stealing as just one part of its computing infrastructure, set aside for problems that can be chopped easily into small pieces and that don’t need to be in touch constantly with a large, central database. “Some applications need gigabytes of main memory to run on, so you can’t run it on a typical desktop that might have 256 megs of memory,” says Vissa. For those kinds of applications and others, supercomputers and Linux clusters will continue to be necessary. Still, by simply squeezing excess power out of PCs, “we could get a 100-fold increase in compute power,” he adds. A typical PC today has a greater than 2-gigahertz processor. “So that’s a lot of compute power for under $1,000 — equal probably to a supercomputer of five or six years ago,” he says.
In a sense, BMS’s approach is a direct descendant of commodity computing and Linux clusters — distributing jobs over hundreds or thousands of a single type of processor — which in itself does not a grid make.
True Grid = Compute + Data
“Grid computing is putting together any form of computational resource that one can put his hands on,” says Manuel Peitsch, Novartis head of informatics and research knowledge management. So far, Novartis has linked about 1,600 of its PCs using United Devices’ distributed desktop product. This year it plans to link all 2,700 desktops across its research sites.
But that’s just the beginning. “By the end of the year, I’m looking forward to having all the Linux clusters, all the high-performance Sun and IBM machines, and the PC grid integrated in a single grid,” says Peitsch. In addition, he is also considering tapping into compute power outside the company’s firewall as if it were a utility that lets you pay for what you use.
Sounds a bit more like nirvana. But not quite. There’s still one important piece missing: the data component. A true grid doesn’t just pool all compute resources together, it would also link all proprietary and public databases so that they appeared as if they were on the local drive of any researcher that should have access to it.
“The compute grid is the first phase because it’s far more advanced and easier to implement,” says Peitsch. “The data grid still has a number of technological challenges.”
Douglas Brown is learning first-hand about the difficulties of dabbling in data grids. A bioinformaticist at the North Carolina State University’s Center for Integrated Fungal Research, his lab was one of a few chosen to help work out the kinks of the state-wide NC BioGrid, currently in prototype phase. The BioGrid aims to pool the computing, data, and networking resources of the state’s universities, as well as those of pharma and biotech companies that are members of the North Carolina Genomics and Bioinformatics Consortium.
Brown is already running his comparative genomic annotation application, De!CIFR, on a small grid of about 20 distributed computers within his lab. “For our little fungal genomes, we’ve got 40 megabases of DNA, which doesn’t sound like much,” says Brown. “But by the time you get done analyzing it, those 40 megabases will translate into a couple of gigabytes of information that you’re going to have to handle and track.” And that’s for a small organism. “When you start to look at larger organisms like the human genome, we’re talking about 3.1 gigabases so it becomes a very large — almost intractable — computational task. And then when you want to look at many genomes simultaneously, combine them all together, it just gets worse.” With his grid, Brown can now automatically annotate an entire fungal genome in less than two days. “It would take up to seven months on a single computer,” he says.
But putting it on the NC BioGrid is another story. First, the various sites contributing to the grid are using different middleware. “UNC uses Platform LSF, Duke University uses Sun GridEngine, the supercomputing center uses Avaki, and at CIFR we run PBS Pro.”
Another problem: With different computers on the grid running different operating systems the algorithms have to be rebuilt for each one. And De!CIFR needs many different pieces of bioinformatics software in its analyses. “You run the program and it says, ‘I want to run Blast.’ The grid has to go find the right version and put it into the computer,” says Brown.
This is a general problem in bioinformatics. Unlike physicists, who generally have one program that they want to run in many places, bioinformaticists often want to run many programs in many places. That means moving and tracking huge amounts of data. “We submitted 20,000 jobs to CONDOR,” a high-energy physics grid at the University of Wisconsin, Madison, says Minnesota’s Retzel. “In bioinformatics it’s not uncommon for one person to submit as many as 500,000 jobs,” he says. “We break CONDOR because we submit too many jobs.”
Despite all the challenges, NC BioGrid hopes to have a fully operational grid within a year. “We’re there as a guinea pig,” says NCSU’s Brown. “They wanted us there because we have serious applications that will test their system. And we wouldn’t get upset when everything kept breaking.”
Being Part of Something Bigger
Even if the technological hurdles to making a global genomics grid a reality are cleared, there are still perhaps even more daunting sociological and logistical issues to deal with. The grid movement traces its roots to the open-source community and follows in the tradition of Linux clusters — a revolution of commodity processors and non-proprietary operating systems. Linux opened the power of supercomputing to the less-financially-well-endowed by letting them string together hundreds or thousands of Intel-based processors to generate an awesome amount of compute power.
The grid would take this notion a step further. “Right now there are certain institutions or certain labs that can do anything. And there are many institutions and many labs that are very limited in what they can do,” says Retzel. “And part of the reason I have interest in this is because tools like this will help level the playing field.” It’s a nice utopian idea. But in reality, in a world where patentable research is gold to universities and investigators race for first publication, there needs to be some incentive for the haves to share with have-nots. Why would they give up their competitive advantage?
“Because you become part of something that’s bigger,” says Retzel. “But there’s a selfish motive as well: The ability to garner resources that you didn’t pay for becomes the carrot to participate.” Some small, focused labs or institutions will have specialized knowledge or equipment that better funded ones don’t. Commercial grid participants will likely charge for access to their resources. But the technologies to take care of the resulting billing and accounting issues have yet to be created.
Another issue that has to be worked out is authentication. “You’d like to think that everybody is altruistic, very smart, and well-meaning,” says Retzel. “But the reality is that some people are altruistic, well-meaning, and don’t really understand what they’re doing,” he says. “And that’s very dangerous.” An effective grid would require some sort of review system. “How do you take information that you can derive quickly and make it part of the database, but also give [users] some level of confidence that somebody besides an undergrad at South Florida Community College ran some program and decided you were wrong?” One possibility, says Retzel, would be to introduce a peer-review process in which researchers can get publication credit for, say, interpretation of a particular piece of the genome.
Retzel suggests that scientific journals might play a role in monitoring the quality of data distributed across a grid. “What I’ve tried to do is to start working with NCBI and Plant Physiology, one of the premiere journals in the plant community,” says Retzel, a legume researcher. His proposal, if funded, and approved by the editorial board, would use some of the journal’s resources to review entries into databases, such as computational results or annotations. “NCBI is actually very anxious to do it,” says Retzel.
Then there’s security. By most definitions, a grid has no central control. So there would need to be protocols embedded that control who gets access to which data and who has rights to make changes. For example, there has to be some way that collaborators working on a grid can share data without exposing it to the rest of their grid’s community
Like the World Wide Web in its early days, grid computing is capturing imaginations and is high on hype. “Grid computing is the next evolution of the Internet,” says IBM’s Farazdel. To be sure, real grids are still futuristic ideals. Standards have yet to be set, protocols must be written, and advanced technologies need to be developed. “What our customers are implementing right now is looking more like a cluster, or a cluster of clusters. They are not really grids,” Farazdel adds.
Nevertheless, to Farazdel and others, grids are inevitable and integral to the future of systems biology. “The kind of problem that grid nirvana can solve is not addressable by any other distributed computing technology,” he says. “We are a long way from it, but we are going in that direction.”