By Meredith W. Salisbury
So you run a lot of Blast jobs. Who doesn’t? But unless you’re at a production-scale genome sequencing center or a similar institution that does so much sequence searching, you probably don’t have an entire compute cluster at your disposal that’s optimized for the Blast algorithm.
Steven Brenner, an associate professor in the plant and microbial biology department at the University of California, Berkeley, says that clusters designed to handle a host of computational problems are becoming more common compared to the Blast-specific flavor. “Increasingly, at least in our experience, people have general-purpose clusters,” he says. Brenner’s own lab uses such a cluster and he says that someone might be running Blast one day and collecting synchrotron data the next. “It’s entirely disparate types of things at different times,” Brenner says.
Which makes Blast, a very specialized and somewhat finicky algorithm that prefers to run on highly optimized clusters, a challenge. Using the tool on general-purpose clusters means having to manually chop up your data, submit a huge number of searches, and repeat the task whenever Blast fails, as certain jobs invariably will. Like so many genome scientists, Brenner’s team used to take this approach to their Blast searches. “We would manually go and see if all the jobs were done,” he says, and for those that failed, “submit those again and again until we got tired of doing them.”
And once they got really tired of it, Brenner and his colleagues began working on a database search tool that would manage this task for them. Known as ANDY — the acronym comes from letters seemingly taken at random from the official “search coordination and analysis” name — and developed by Andrew Smith, John-Marc Chandonia, and Brenner, the tool is aimed at users trying to tame general-purpose clusters into running Blast efficiently. ANDY “automatically goes and divides [your Blast query] into smaller subtasks and sends those to cluster nodes,” Brenner says. “[The result] then goes back to the root node to be integrated.”
Non-specialized clusters usually run DRMs, or distributed resource management programs, that parcel out node use and optimization as queries come in. ANDY “is designed to be able to interface with all of [the DRMs],” Brenner says. Current modules make ANDY compatible with a few DRMs, and future modules will expand that.
He notes that there are plenty of programs targeted at large-scale sequence searching that do much the same thing as ANDY. This tool differentiates itself in one way by being completely free and open-access, and in another with a distinctive two-mode feature. In the more efficient mode, ANDY sends a job to each cluster node and your query essentially takes over the cluster until it’s done. There’s also a “fair use” mode that sends out each subtask and between searches allows other people’s queries to run before your next search is sent. While it takes a little longer, it’s “much more friendly” for general-purpose clusters that are shared among many researchers.
The tool scales well for both small and large clusters, Brenner says. ANDY uses a master node in the cluster to coordinate job assignments, “which should make it more efficient for large clusters than many other tools,” he says. But he warns that exceptionally sizable clusters may have trouble. If a cluster is big enough, “at some point it will overwhelm the ability of one node to oversee [the job].”
Brenner says his team has used ANDY for many kinds of queries that have a variety of parameters, including structure comparisons and building phylogenies. “We built it to be able to handle repetitive tasks other than Blast,” he says.
Because the program code is readily available through the Berkeley website, Brenner hopes researchers will add modules to make ANDY compatible with, say, other DRMs. “We encourage people to add extensions,” he says, “and contribute [them] back so the whole community can benefit.”