By Meredith W. Salisbury
Determined residents of the do-it-yourself community, Bud Mishra and his New York University labmates find it hard to believe that the cluster they’ve built is all that special. For them, it’s simply a tool: it enables better and more interesting science, and it works so well they sometimes forget it even exists. For your average genomics lab, that’s not just special — it’s a dream come true.
Mishra, a professor and scientist at NYU’s Courant Institute, has made his name developing complex algorithms used for genomics research, such as the enabling informatics tools for David Schwartz’s optical mapping technology.
His lab has collaborators in as many fields as he can handle: cancer research, comparative genomics, evolutionary genomics, and aging disease studies, to name some. In order to accommodate this incredibly detailed research, Mishra and his crew “decided to invest a large amount of our effort understanding the physical properties of the genome,” he says. “If I could walk along the genome, if I could touch it, what would it feel like? Would it be unstable? If I pull off a piece, will it have a secondary structure making microRNA?”
Fascinating questions — but clearly ones that are incredibly compute-intensive. Mishra has plenty of friends who could get him access to major, 200-plus-node clusters, but it still wasn’t as convenient as running the problems in his own lab. As if on cue, the state of New York offered Mishra $750,000 through a faculty retention program, “with the condition that I would lose it if I crossed the state line,” he says. He earmarked $250,000 of that for new hardware, including a cluster.
But a lab full of computer scientists wasn’t about to write a check to a vendor and watch their cluster be installed, Mishra says. In four weeks and with $90,000, his group pulled together an enviable 16-node, dual 2.4-GHz processor Linux cluster with a terabyte of disk storage. Connected with a low-latency, high-speed network, the only snag Mishra’s lab encountered was heat: after it was built, the cluster had to be moved to a different room with better cooling capacity.
Building the cluster themselves gives the researchers an edge in using it, too. Knowing the architecture as thoroughly as they do is a major plus when it comes to adding hardware. And if more processors need to be tacked on, the team sends in an order, gets the chips (more cheaply than they would with a vendor-built cluster), and just “stays up, gets a lot of pizza, and pulls it together overnight,” Mishra says. Thanks to a computing environment developed in his lab, the cluster works constantly and is connected in such a way that “if a node fails, you won’t even see it.”
The burst of compute power has changed the way Mishra’s team can approach genome informatics. “When I click on a region, I actually see the difference right after I lift my finger on the mouse,” he says. “Some people argue that this is overkill; you could preprocess and wait three hours and get the same thing.”
But those three hours are valuable to Mishra. Having access to real-time answers from involved computational questions “introduces a sense of playfulness so important in computer science and mathematics and so lacking in biology,” he says. The computer scientist’s equivalent of wondering aloud is almost eliminated “if it takes you six months to write a program and every time you ran it it took 24 hours.” With instant results, Mishra and his team can write algorithms to answer questions they never would’ve had time for before — among his favorites is creating artificial genomes and figuring out what they might look like if they were real.
But the true test of any technology, of course, is its invisibility. Mishra pays it one of the highest compliments: “I’ve gotten used to this cluster so much I don’t think about it,” he says.