Pushing the envelope on analyzing mountains of biological data, two research teams have worked out new algorithmic methods, scheduling tools, and input/output approaches to map genomic-analysis software to IBM’s BlueGene supercomputer platform and other parallel architectures.
At the Supercomputing 2008 conference last month in Austin, Texas, scientists from Washington State University and Virginia Tech presented papers on the use of BlueGene’s massively parallel architecture for heavy-duty bioinformatics tasks.
In one paper, Ananth Kalyanaraman and Changjun Wu of the School of Electrical Engineering and Computer Science at Washington State presented a method for parallelizing protein identification in metagenomic data sets for the BlueGene/L system. In the other, Wu-chun Feng of Virginia Tech and colleagues from several other institutes described how they mapped Feng’s mpiBlast sequence search software to the BlueGene/P platform.
Memory and Metagenomics
Kalyanaraman’s paper focused on new methods to circumvent the lack of available tools for metagenomics data sets. “There seems to be a general excitement over the application of supercomputing in a new, emerging area like metagenomics,” he told BioInform in an interview.
As metagenomics studies churn out more and more sequence data, there is “an immediate need” to determine which open reading frames match up to known proteins in order to determine which ORFs are novel in a given data set, he and his colleagues wrote in their paper.
But, they added, all-versus-all pairwise similarity searches are “computationally prohibitive” for metagenomics analysis, and the complexity of the comparison can compromise output quality. Instead, they decided to leverage the memory and compute power of large-scale distributed memory computers.
Kalyanaraman and colleagues decided to follow in the footsteps of Craig Venter’s Sorcerer II Global Ocean Sampling project, which “is the only work that implements a large-scale methodology for protein family identification in metagenomic data,” they wrote. The GOS researchers developed a three-step process for metagenomic analysis: removing redundant sequences; generating a graph of the set of non-redundant sequences; and then analyzing the graph for so-called “dense” subgraphs, which indicate protein families.
The Washington State scientists noted that there are numerous challenges with this approach, however. Namely, the first step can be distributed to multiple computers for parallelization rather easily, but the next two steps “require creation of adjacency list and random access,” and must therefore be implemented on large-memory symmetric multiprocessing nodes.
“Both our method and the GOS method reduce the protein family detection problem to the problem of finding dense subgraphs in graphs,” Kalyanaraman said. The main difference is that the GOS method requires the entire graph be stored in memory before detection. “That is the reason [the GOS project] had to use very large-memory SMP machines in their project,” he said. “We, on the other hand, have shown that this problem can be solved without having to store the graph.”
The Washington State approach relies on the idea of “connected components,” which can be dynamically generated in parallel. “As dense subgraphs cannot span across multiple connected components, it is sufficient to subsequently distribute the connected components across multiple processors in a load-balanced fashion and analyze each
connected component without requiring to communicate,” the authors wrote.
This concept “provides a new way to efficiently break the problem size down significantly into numerous tiny sub-problems, and thereby can allow the use of conventional distributed memory machines,” Kalyanaraman said.
Distributed memory machines scale better than shared-memory machines, Kalyanaraman said. “Shared memory machines are great for sequential algorithms that have large memory requirements, but provide only a limited amount of parallelism,” he said. “Distributed memory machines provide both time and memory advantages but typically are harder for [the] design of algorithms.”
For their study, they used a BlueGene/L supercomputer for the redundancy-removal and connected component-detection steps, and a Linux cluster for the subgraph steps, a choice made primarily for logistical reasons because the BlueGene/L was located at Iowa State University.
Since access to the Blue Gene/L was limited, “we used it to solve the most computational phases of our approach,” he said. That reduced the magnitude of the challenge, allowing subsequent work on the cluster. “In practice, we could have done all phases in either of the machines,” he said.
The code the scientists wrote is in C and MPI. “So it will work on any workstation-based commodity cluster,” Kalyanaraman said. They tested their approach on BlueGene/L so they could scale up to 1,024 nodes. “In fact, we also regularly run our code on a local 24-node Linux cluster,” he said.
“Right now sequencing is getting faster and faster, and our sequence search is getting slower and slower because the databases are growing faster than our ability to compute on them.”
He said he appreciates BlueGene’s high-speed network interconnectivity, but noted that the memory per node is limited. “Each node has two processors with only 512 MB [of] RAM to share between them,” he said. That is in contrast to a cluster that is usually delivered with about 2 gigabytes of RAM per node. “This means, for an algorithm to be able to use BlueGene/L effectively, it has to be highly memory-efficient,” he said.
For redundancy removal, the Washington State scientists used a standard pattern-matching heuristic. For generating bipartite graphs, the team modified a parallel sequence clustering algorithm they had previously developed called PaCE. Once the graphs were generated and distributed to the processors, the team applied an algorithm called Shingle that IBM developed for identifying subgraphs within very large graphs.
Kalyanaraman developed PaCE, for Parallel Clustering of ESTs, while he was at Iowa State University working on maize-genome assembly. “For this project we use it, with minor code modifications, to detect connected components because the problems are equivalent in a computational sense for sequence input data.”
Shingle was originally developed for detecting web communities, he said. “We reduced our problem to this problem, and therefore were able to use this algorithm as well for protein family detection.”
In a test run using a 160,000-ORF data set from the Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis database, the Washington State pipeline completed redundancy removal and connected component detection in three hours and 20 minutes, with redundancy removal accounting for 90 percent of the run time, the scientists said.
The connected component-detection phase is faster than the redundancy removal because successful alignments lead to merging of clusters, “and thereby result in drastic reductions in alignment work,” they wrote.
“This corresponds to a 99-percent [reduction] in work when compared to any scheme that deploys an all-versus-approach approach,” they wrote.
The downside of this, however, is that the connected component-detection phase scales “poorly” because the clustering heuristic eliminates the majority of promising pairs, the work generated by the worker processors was being “too aggressively filtered out on the master node,” leaving hardly any alignment work to be redistributed to the worker nodes.
This is both good and bad: Only a fraction of the generated pairs were actually aligned, which “drastically” reduced overall time to solution, but the work reduction reduced the scaling.
Going forward, Kalyanaraman said he and his team want to perform qualitative validations of their approach to protein identification by using, for example, “high-quality benchmarks, basically protein families,” he said. They also plan to parallelize the Shingle algorithm to address memory challenges and see if the approach can scale to much larger datasets.
“The scale of problem we solved in the SC08 paper is really small when compared to what is out there in the CAMERA databases,” Kalyanaraman said, noting that the paper was primarily a proof of concept for the approach.
“While the ideas are fundamentally new and technically sound, our implementation now has to be further fine-tuned before the code can be broadly distributed,” he said.
Porting to Petascale
Virginia Tech’s Feng said that the method he presented at SC08 is also a work in progress. “It’s a hacked prototype, but we know that the techniques work because we have tested them in our prototype and they generated answers that we are looking for,” he told BioInform.
In the paper presented at SC08, Feng and colleagues demonstrated that they could deliver “nearly linear scaling,” or 93-percent efficiency, for mpiBlast on up to 32,678 cores. They also demonstrated that they could use mpiBlast to search a microbial genome database against itself — a task previously considered to be computationally intractable — in 12 hours on a BlueGene/P.
Feng and his co-authors note in the paper that parallel sequence search applications such as mpiBLAST work well on computer systems with hundreds to thousands of nodes, but as these systems reach the petascale, with tens of thousands of nodes, challenges crop up such as irregular run times and increases in I/O time.
“Even if the computation is efficient, the output can easily become a bottleneck,” the scientists wrote, adding that processing the output data and writing it to a file system efficiently are both “critical to achieve high performance” in sequence search applications.
To tackle these challenges, the researchers developed an improved scheduling algorithm and “an innovative technique to do input/output,” Feng said.
The original architecture of mpiBLAST relied on a master-worker design to partition the tasks and a supermaster that delegates those tasks. The challenge with that architecture, Feng said, is that it does not scale well to tens of thousands of processors, so in porting the code to BlueGene/P, the researchers adopted a “more scalable” master-worker paradigm.
Feng explained that if all the workers report to the master for each batch of queries and all workers must be completed before the next batch can be fetched, it is as if all the workers in a factory reported directly to the CEO.
“You can imagine the kind of chaos the CEO has to deal with in terms of bookkeeping and interacting with all the employees. It doesn’t scale to large numbers,” he said.
Therefore, the scientists developed a data-management approach that assigns “an arbitrary ratio of the number of masters to the number of workers,” giving each inner-partition of nodes one master for an arbitrary number of workers.
Under this new structure, each inner-partition can have multiple copies of the database, and each master “can handle multiple worker groups simultaneously, allowing for a more fine-grained and flexible ratio for the total number of masters in the system to the total number of workers.”
Merge the Workers
The method created a challenge that the team had to address: concurrent output to the result file. They evaluated three I/O strategies in a “stress test” using the National Center for Biotechnology Information’s nt database, which includes more than 6 million sequences and has a raw size of 23 gigabytes. Sequences randomly sampled from nt were used as query sequences.
They tested two existing I/O strategies — WorkerIndividual and WorkerCollective —and the one they developed called WorkerMerge. In WorkerIndividual, the method used in the original version of mpiBlast, the workers receive write offsets of buffered output blocks from the master and “go ahead and issue write requests to the shared file system,” the scientists wrote. This model avoids synchronization overhead, the researchers stated, but it “may be inefficient” as a process.
For WorkerCollective, “workers coordinate their write efforts into larger requests.” This system, however, may “incur frequent synchronization” which is “undesirable” for parallel sequence searches.
WorkerMerge “performs asynchronous, decentralized writes with merged I/O requests,” the scientists wrote. In this arrangement the master finishes one result merge for a query and “appoints one of the workers to be the writer for this query.”
The advantage of this method, the researchers said, is that it “takes advantage of collective I/O and removes the synchronization problem.” It collaborates with the scheduling algorithm such that “a large number of workers can be supervised by one master,” they said.
In their evaluation run, the team found that WorkerIndividual “scales poorly,” and both WorkerCollective and WorkerMerge scale well.
Feng noted that while running a problem in the massively parallel computing environment doesn’t immediately solve biology problems like finding missing genes, “the run enables us to pare down and filter what we need to look at to find missing genes.”
Like Kalyanaraman, Feng said that the BlueGene architecture shouldn’t be a requirement for his team’s parallel code. The techniques are applicable to other computer architectures, even “a gaggle of smaller clusters, if you know what you are doing,” he said.
“One of the big problems we had with mpiBLAST has been that people don’t have the computer savvy to know how to install, administer, and configure the application software on the system and how to run it appropriately and partition the job,” he said. “If you don’t know how to do that, the performance you get is not going to be very good.”
Feng said that the rapid increase in sequence data, driven by new technologies, will likely push biologists toward parallel systems. “Right now sequencing is getting faster and faster and our sequence search is getting slower and slower because databases are growing faster than our ability to compute on them,” he said.
“The challenge of doing bioinformatics is suddenly not just going to get exponentially harder but is going to pop out of the ceiling and be astronomically difficult to tackle,” he said.
Feng and Kalyanamaran have been discussing ways to combine their methods, which both researchers described as complementary.
Feng said that Kalyanaraman chose a particularly “hard problem” to tackle and praised the “algorithmic innovation” in his work.
Kalyanaraman, meanwhile, said that Feng’s project yielded “amazing results” and noted that his team could “benefit from Blast being able to run fast.”
Both teams are “taking a complementary approach to a problem that is going to be difficult to solve computationally in the coming years,” Feng said.