When the bioinformatics team at the University of Minnesota determined that available grid computing tools like Globus weren’t quite mature enough to help them distribute their annotation tasks for the Medicago trunculata sequencing project, they didn’t let that stop them. Instead, Chris Dwan and his colleagues built their own system to distribute a large set of jobs across several clusters on the University of Minnesota campus and beyond, making “a real live grid, except without any of the Globus or Global Grid Forum trappings to it,” Dwan said. “It’s just what a halfway intelligent person who knows Perl and has some friends could do.”
It turns out that Dwan’s friends were just as important as his coding ability in getting the project off the ground. Faced with the task of keeping the annotation for M. trunculata current using a 20-node cluster in his lab — a process that would take around three weeks for each annotation cycle — Dwan said his group had little choice but to enlist the aid of outside clusters. Drawing on his network of “friends,” Dwan soon gained access to several additional computing facilities on campus, as well as at the Center for Integrated Fungal Research at North Carolina State University, which was collaborating on the M. trunculata annotation project.
Before too long, Dwan had access to the university’s 160-CPU supercomputing center, 12 Apple G4 workstations in a university computing lab, another small cluster of three Apple Xserves, and 20 PCs from the North Carolina group in addition to his own cluster. Recently, the project has run some test jobs on a computing lab in the university library that contains 30 Macs and 90 PCs.
“Right now, to run a complete annotation takes about five days,” Dwan said, which means the M. trunculata genome can be re-annotated once a week instead of once a month.
Dwan said that the “social aspects” of gaining access to his distributed set of computational resources proved to be the biggest hurdle in the task. “Once we have decided that we’re going to collaborate, from the technical side you can do it a number of ways,” he said.
The technical side of the project proved to be relatively straightforward, although there were a few obstacles to overcome. The team wanted to use the Ensembl annotation pipeline, which automatically runs raw sequence data through a series of gene prediction algorithms. All of Ensembl’s code is open source, but it’s written to run on the Sanger Institute’s massive computing infrastructure, so it to had to be modified to run on a distributed, heterogeneous system. “What I had in the Ensembl pipeline was a script that submitted a bunch of jobs to one cluster,” Dwan said. “All that we really did was to add a layer of indirection to that so that instead of submitting to just one cluster, I am authorized to use resources on a bunch of different clusters, so we just go round robin.”
Dwan opted to forego Globus and other grid middleware tools in favor of writing his own scripts to distribute the jobs. “My experience has been that Globus right now is really tricky to get working,” he said. “All of my collaborators have existing computational resources that are serving the needs of their owners right now. So for me to walk up and say, ‘Hey, let’s modify the way you give access to those resources — in effect, let’s break your existing system — so I can get some juice out of it,’ that didn’t fly.”
Instead, Dwan wrote a reservation agent that allows him to store user names and passwords for each of the systems he is authorized to use in a file, “and instead of submitting to one, I can submit to any of them,” he said. Once his lab’s cluster is full — generally at about 40 jobs — the scripts redirect jobs from system to system “to their comfort level.”
One shortcut Dwan arranged involved a bit of diplomacy. To overcome the “data motion problem” in which quick access to 20 GB of data would be required to start each job, his team offered to provide each of the collaborators’ machines with an additional 80 GB disk, with 20 GB reserved for the M. trunculata data, which is updated nightly. “You could see [the computer lab administrators’] eyes light up,” Dwan said.
Dwan said that he’d like to share the scripts he’s written with other groups, but the software is still too rough around the edges to release without adequate support. The goal, he said, is to eventually set up a community-based project on Sourceforge, “but right now, as a bioinformatics lab, our first goal is to serve the people who are paying the bills, and that’s this community of Medicago researchers who want some additional stuff out of the annotation.”
But Dwan was quick to point out that his solution may not be for everybody, particularly those without the budget constraints of an academic lab. “If you have the money, don’t do what I did,” he said. “Buy a big cluster.” Furthermore, he added, the scripts he wrote are basically a short-term solution until better, standardized tools emerge from the grid computing development community. “If somebody could hand me a grid middleware solution to do these things so that I didn’t have to write them, I would happily throw away my last year of work,” he said.