By John S. MacNeil
In some respects, cluster computing has become the default solution for a wide array of problems in the computational biology community. From distributing compute jobs across a stable of in-house or remote PCs to utilizing a single piece of hardware designed to split up a calculation across a number of nodes, cluster technology has often been deemed an efficient approach to quickly solving compute problems in life sciences. “Around the turn of the century, people really focused on ‘clusters, clusters, clusters,’” says Dan Stevens at Silicon Graphics.
But there may be alternatives better suited to the problem at hand. At Pacific Northwest National Lab and Oak Ridge National Lab, researchers are engaged in a project to test out different types of computer architecture in an effort to find the specific applications they best serve. The BioPilot project, as the initiative is known, is seeking to define the cases where shared memory systems such as the Cray X1 supercomputer and SGI Altix 3000 are more efficient at addressing the needs of biologists.
While such supercomputers may cost more than a run-of-the-mill cluster setup, researchers at PNNL and ORNL believe making better use of the computing resources they have on hand would prove a boon to computational biologists. In addition to cluster computing resources, both ORNL and PNNL have access to 128-processor SGI Altix machines, and ORNL has a 59-teraflop Cray X1. Part of the BioPilot initiative will involve adapting code for both cluster and shared memory architectures, and comparing the computing performance.
With $2 million split evenly between PNNL and ORNL from the Department of Energy’s Office of Advanced Scientific Computing Research, the BioPilot project has already identified three areas that may benefit most from shared memory system architectures, says T.P. Straatsma, PNNL’s associate division director for computational biology and bioinformatics, and principal investigator on the BioPilot project. “The objective of this project is to look into how shared memory architectures can help us in three different areas of computational biology — mainly proteomics, biological network analysis, and molecular modeling — and to compare and explore computer architectures for use in these particular areas,” he says.
Unlike cluster computing, which is designed to tackle problems that can be split into little pieces and doled out to individual processors — so-called “embarrassingly parallel” problems — a shared memory system like the SGI Altix or Cray X1 makes the most sense for problems that require a significant amount of memory on hand to complete the calculation, and where solving individual pieces of the problem requires the results from other processors’ tasks. To be precise, a shared memory system occupies the middle ground between a true supercomputer, in which every processor is seamlessly interconnected with its neighbors, and a cluster architecture that relies on input/output devices to send and receive data between individual processors and a central memory/task manager.
Straatsma says it’s hard to generalize about what types of biological computations would benefit from shared memory systems. Most algorithms are built with a shared memory architecture in mind, says SGI’s Stevens, but only in the specifics of a particular calculation does it become clear whether computer scientists should optimize the algorithm for a particular hardware configuration. “So much of it comes down to case by case considerations,” says Andrey Gorin, an ORNL researcher involved in the BioPilot project. “But we tried to select applications that are very painful for clusters.”
Proteomics calculations are good candidates for shared memory systems, says Gorin, because identifying a peptide fragment from mass spectrometry data typically involves large numbers of “look-ups” in a database of theoretical protein fragments. Commercial protein mass spectrometry software packages, such as Thermo Finnigan’s Sequest and Matrix Science’s Mascot, apply this approach when attempting to identify the parent protein to which a peptide fragment belongs.
This strategy for identifying proteins from mass spectrometry data works well when the fragmentation patterns are fairly clean, Gorin says, but problems arise when the protein to be identified contains a potentially important mutation not found in the database. An algorithm developed in Gorin’s lab addresses this issue, and it is notable, he says, that his group’s de novo peptide sequencing algorithm succeeds without the aid of high-end mass spectrometers like the Fourier-transform instrument.
Shared memory architectures allow for each processor instant access to the indices that can dramatically accelerate identification of “not expected” peptides in the proteomics samples. For example, an index of all possible amino acid combinations (up to a certain peptide length, e.g. 8 amino acids) instantaneously provides information about all present and absent variants of the peptide, says Gorin. “The analysis of the potential point mutation can be done much more rapidly by the faster assessment of the potential peptide variants,” he adds.
Gorin says the advantages of this approach should result in significantly higher rates of positive protein identifications. In a typical complex protein analysis by mass spec, up to 75 percent of the peptides in a sample remain unidentified, but de novo methods such as the approach developed by Gorin’s group should push the level of unidentified peptides down to just 25 percent, he adds.
The dynamic modeling of cellular networks represents another application of computational biology that could take advantage of shared memory computing architecture because the disparate cellular processes under investigation occur at widely varying time scales, PNNL’s Straatsma says. Stochastic modeling of these requires combining processes that occur at short time scales with those that occur at much longer time scales to allow results from one part of the simulation to feed back into a different process. Pulling this off will undoubtedly require shared memory systems, Gorin adds, as the number of interacting processes increases with the complexity of the cellular model.
Gorin sees the BioPilot project as an essential exercise to advance computer science — and, by extension, computional biology. “When hardware stops developing because of inherent limitations, then human creativity will have to discover smart ways to move forward,” he says. “Biology is really an area where [optimizing] software and hardware architecture will be needed.”
Computing Architecture in a Box
In contrast to a traditional supercomputer, where the proprietary and specialized processors and memory are seamlessly interconnected with their neighbors, a shared memory, or symmetric multiprocessing system, consists of tightly connected, standards-based microprocessor and memory. Like traditional supercomputers, shared memory systems also have a single memory address space where multiple processors and I/O nodes access all the data in the system’s memory. In contrast to these, a cluster computer scheme relies on distributed memory, where the memory is specific to each processor node.
Mixed Architecture: SGI’s Project Ultraviolet
Another alternative to cluster computing waiting in the wings is Silicon Graphics’ Project Ultraviolet, a beta-stage effort to produce a data-intensive hardware architecture that packs off-the-shelf processors together with field-programmable gate arrays or other types of application-specific hardware under one centrally coordinated global memory space.
The advantage to this “multi-paradigm” computing, according to Dan Stevens, life, chemical, and materials business development manager for SGI, lies in its ability to switch easily between calculations that are best handled by application-specific hardware — such as molecular modeling or image processing tasks that require huge amounts of processor time — and more mundane calculations that are best served by the off-the-shelf processors also located within the central kernel.