Wu-chun Feng, associate professor in the computer science and electrical and computer engineering departments at Virginia Tech, is in the process of optimizing several genomic sequence search tools, including mpiBlast and the Smith-Waterman algorithm, for massively parallel computer architectures like IBM’s Blue Gene and Cell Broadband Engine.
Feng told BioInform this week that adapting mpiBlast and other bioinformatics codes for parallel platforms is going to become “more and more critical” as scientists obtain more and more complex data from metagenomics and other large-scale experiments. The problem, he said, is that these codes were written for single-CPU environments.
“You are opening up the floodgates and it is going to be a necessity to figure out how to map [bioinformatics] algorithms” onto multicore, hybrid multicore, Cell, or other emerging parallel architectures,” he said.
Feng and his colleagues at Virginia Tech are adapting mpiBlast, an open source parallel implementation of NCBI Blast, for IBM’s Blue Gene architecture.
His group is also exploring the IBM Cell Broadband Engine, a processor jointly developed by IBM, Sony, and Toshiba that was originally designed for the PlayStation 3. However, since mpiBlast would be “tricky” to port to the Cell, Feng said he has decided to map the Smith-Waterman alignment algorithm to the processor instead.
“We first got a two-fold slowdown [for Smith-Waterman on the Cell] relative to our multicore implementation,” he said. Additional optimization, which took several months, led to a 25-fold speed up.
Breaking Blast – on Purpose
Prior to mpiBlast, which Feng and colleagues initially released in 2002, Feng said that attempts to parallelize Blast usually fragmented the query file. “It’s an embarrassingly parallel way to do things and it works well, but you can do better,” he said.
Instead, he decided to not only fragment the queries but segment the database. “This is where they said I would break Blast, because the way the scoring of matches occurs is dependent on the search space inside of the database,” he said.
“This is where they said I would break Blast, because the way the scoring of matches occurs is dependent on the search space inside of the database. But I went ahead and did it anyway and sure enough they were correct — I broke it.”
“But I went ahead and did it anyway and sure enough they were correct — I broke it,” he said, explaining that the approach identified the right genomic sequences but the wrong scores, because the scoring is based on the size of the search space. “Then it was just a matter of doing the math to fix the scores,” he said.
Once the bugs were worked out, mpiBlast was able to achieve greater-than-linear acceleration, Feng said. On the Los Alamos National Laboratory’s “Green Destiny” cluster, where the first version of mpiBlast ran on 128 processors, “we had 170-fold speed-up,” he said.
For the latest production version, which has not yet been released, he said, his team achieved a 325-fold speed-up for 128 processors.
The “wrinkle,” he said, is that the speed-up doesn’t scale well beyond 128 processors with a single-CPU architecture. Beyond that point, “we start to see enough degradation in the speed-up, that when we get up to 240 nodes, we only had a speed-up of 230-fold.”
However, this challenge appears to be surmountable for the Blue Gene architecture. So far, using “research code” that has not yet been “properly engineered,” mpiBlast is delivering close to linear scaling — 93 percent efficiency — on a Blue Gene/P system.
“For 32,768 processors we are getting over a 30,000-fold speed-up relative to a single node,” he said.
At the moment he said he cannot devote his full-time efforts to this optimization, but juggles his projects in order to continue work on scaling mpiBLAST to Blue Gene. He does not have a timeline for the completion of this project, he said.
Too Hot to Run
The mpiBlast project is a joint undertaking between researchers at IBM, North Carolina State University, and Argonne National Laboratory.
IBM’s Janis Landry-Lane, program manager in the company’s deep computing group, said that the company is looking to ensure that scientific applications can map to a range of IT architectures.
“Whether it’s sequencing or researchers doing molecular dynamics or chemistry, there is a big spectrum of application,” she told BioInform. In addition, she said, IBM offers a “variety of architectures and all of them have a place in the high-performance computing environment for this community.”
Landry-Lane noted that IBM currently markets Cell, pSeries, Blue Gene, and clusters based on Intel, AMD, or Cell processors. “Does this make it complicated for us? Yes. We have to figure out which one which one works best with which application.”
For some researchers, architecture choice is based on turnaround. “Here is my job, run it across as many processors as possible and give the answer as soon as possible,” said Landry-Lane.
The other criterion is one she calls throughput measurement, which involves loading up the machine and continuously sending jobs through the processors to see how much work gets done.
Landry-Lane said that IBM is currently working on a series of throughput and turnaround benchmarking tests aimed at gathering precise information for the scientific community, including bioinformatics researchers.
“We are working at these benchmarks and comparing platforms” using a variety of scientific problem types, including Gaussian calculations, computational chemistry, and fluid dynamics, on many types of platforms. “Because we have so many platforms we have to do a lot of work back at the ranch with our benchmarking center,” she said.
While the financial industry is a big IBM customer, the life sciences and science in general are of great importance to the company, she said. “Science pushes IBM to think about problems in hardware, software, data management, [and] information life-cycle management in a different way. It pushes IBM to create supercomputers to handle this emerging set of problems.”
IBM recently garnered attention with its petaflop-busting hybrid “Roadrunner” supercomputer. Roadrunner’s architecture includes two Cell blades and one AMD blade, along with one “connectivity blade.”
However, Landry-Lane noted that Roadrunner has a “complicated programming model,” and the company views the Cell architecture as more attractive to the bioinformatics community.
“We are looking at bioinformatics to use Cell, not [the] Roadrunner architecture,” she said.
Selling the Cell
Cell’s main processor core, a 64-bit Power Processor Element, hands out work to eight SPEs, or Synergistic Processing Elements, Landry-Lane explained.
The challenge, said Feng, who is working with IBM on Cell-oriented projects, is that Cell is not easy to program. With its eight SPEs and a main processor, “you have this issue of how do you efficiently move data back and forth and how do you hide data movement overheads,” he said.
The distributing element incurs overhead, which is a non-negligible amount of time to transfer data between the PPE and the SPEs. “If the overhead for you to move information to the Synergistic Processing [Elements] outweighs the benefit of computing it on the PPE, then the overheads are too high, so you have to find some way to deal with the overheads, in moving data around. That’s what makes it difficult to program,” he said.
The lesson, said Feng, is that not all bioinformatics applications can be expected to scale well or easily.
SIDEBAR: ‘Wonkavision’ for Scientific Data
In addition to its work on improving the computational performance of mpiBlast, Feng and his colleagues are also exploring novel approaches to storing large bioinformatics data sets.
Last fall, Feng’s team won the Supercomputing 2007 Storage Challenge for a project called “ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing.”
Feng described ParaMEDIC as “Wonkavison for scientific data,” because it allowed the team to shrink a petabyte of data so that it can be transferred to a remote site, and then re-expanded.
The framework at the heart of this “Wonkavision” technique is the ParaMEDIC storage transfer approach, which decouples computation and input/output in order to reduce I/O overhead, Feng explained.
Feng said that the storage effort grew out of a project his team was conducting that used mpiBlast for an all-against-all comparison on 567 microbial genomes. The computation took two weeks running on 12,000 processors on nine supercomputers distributed across seven US sites, and the results were nearly a petabyte of data.
The team decided to store the data at the Tokyo Institute of Technology, “because that was the only place that had enough space to store the information,” Feng said. However, shipping the dataset via a shared Gigabit Ethernet link would have taken “on the order of half a year to several years to transfer,” he said.
In response, he and his colleagues developed ParaMEDIC in order to circumvent that tremendous bottleneck by shrinking the data, sending it to Tokyo, and then re-expanding it to the full petabyte, he said.