By Meredith W. Salisbury
Two and a half years ago, Wu-chun Feng had no idea what Blast was. So it might’ve come as quite a surprise to him back then to hear that by the end of 2004, a parallelized version of Blast that he’d dreamed up would have been downloaded more than 10,000 times by researchers throughout the life science community.
In December, Feng and his crew were busy preparing for a new release of the freely available mpiBlast, a highly parallelized version of the ubiquitous sequence search tool that boasts speeds up to 170 times faster than NCBI’s version of the open-source tool. And last year, mpiBlast won an R&D 100 award from R&D Magazine.
Feng, a team leader and institute fellow at Los Alamos traditionally focused on high-speed networking research, stumbled into the bioinformatics field about two years ago during an attempt to solve a basic computing problem. “Our traditional cluster was failing every three to seven days,” he says. He came up with the idea for an alternative architecture machine that would be dubbed Green Destiny, a 240-node supercomputer with a five-square-foot footprint that consumes three kilowatts of power, or about the equivalent of two hairdryers, Feng says.
The systems community showed little interest in the machine, noting that its lower-memory-per-node structure wouldn’t win any power contests. But pharmaceutical and bioinformatics companies latched on, regularly asking Feng to give talks. When he asked each audience what their most-used program was, the dominant answer was Blast. Feng went back to his lab with an idea for running Blast in parallel to optimize it for his supercomputer.
While most researchers using Blast on a cluster simply copied the sequence database to every node, that wasn’t an option on the low-memory Green Destiny. Feng asked Aaron Darling, a student interning in his lab, to take a crack at the code to implement Feng’s idea: breaking the sequence database into pieces small enough to fit in the memory of each node on Green Destiny. With a partitioned database — the step most of his bioinformatics audiences had told him would be impossible — the first version of mpiBlast returned with seriously messed up scoring statistics, or E-values. But the actual sequences returned proved to be spot on, and Feng figured out how to get the worker nodes to report a correct E-value.
In an early test, a job that ran on one processor using a standard query took 22.4 hours, Feng says; running it on 128 nodes with mpiBlast returned an answer in just under eight minutes. That 170X superlinear speedup comes from keeping the machine performing its fastest steps and avoiding time-consuming processes like going to disk.
Though mpiBlast was originally written for machines like Green Destiny, the program’s “applicability is much broader than we imagined it to be,” says Darling, now a National Library of Medicine fellow at the University of Wisconsin who still manages the technical side of mpiBlast. “We’re seeing exponential growth in the size of sequence databases,” Darling says, so now “every average compute cluster needs the database to be segmented because the database is growing faster than memory is.”
In the latest release of mpiBlast, Darling says, the program is now able to handle variability in cluster resources, “so you can add and remove hardware and mpiBlast will adapt as you change.” Generating exact E-values is another new feature, as is the concept of streaming results — instead of waiting for the entire query to finish to report results, mpiBlast will start giving back data as it completes each section of the query, Darling says. That last part plays into a future step Feng envisions for mpiBlast: a Web server that could give researchers a resource for on-demand Blast searching.
Going forward, Feng hopes to give the code a more polished feel. “We’re looking much more seriously right now into really cleaning up the code, enhancing the scalability and usability for the greater community,” he says.