Skip to main content
Premium Trial:

Request an Annual Quote

Two DOE Labs Buck the Biocluster Trend to Test-Drive a Few Architecture Alternatives

Premium

Cluster-based architectures have quickly become the standard in bioinformatics and other computationally intensive research domains, but some scientific computing experts are beginning to doubt the broad applicability of this approach.

In an effort to address this concern, the US Department of Energy has initiated a one-year exploratory research project to evaluate the performance of several emerging bioinformatics applications on a range of computational architectures. DOE’s Office of Advanced Scientific Computing Research awarded $1 million each to Pacific Northwest National Laboratory and Oak Ridge National Laboratory to carry out the so-called BioPilot project, which was formally launched earlier this fall.

“The one thing we know about these science applications is that one size does not fit all, and one architectural type does not fit all,” said Gary Johnson, program manager for the Advanced Computing Research Testbed program at the DOE’s ASCR. “So it’s very useful to have a variety of architectures available.”

The primary goal of the project is to determine which architectures are best suited for scientific applications in an area the DOE has dubbed “data-intensive” computing. DOE is still “going through a process of defining what we mean by data-intensive computing,” according to TP Straatsma, PNNL’s associate division director for computational biology and bioinformatics and co-PI on the BioPilot project. Nevertheless, he said, the working definition “is when you have a large amount of data and you don’t have one specific model with which you can extract the information out of that data.”

Computational biology offers a number of examples of this situation, Straatsma said. In the case of proteomics, for example, “you’re looking at a large amount of data from which you try to extract things like the networks that govern signaling, or gene expression, or the whole metabolism of cells.” Unlike some scientific fields that rely on large data sets, but tend to have only one particular method to extract information from that data, proteomics “requires a lot of different methods and techniques to be applied to that data in order to get that knowledge out of it,” Straatsma said.

Under BioPilot, the PNNL and ORNL researchers will explore these challenges of proteomics data analysis, in addition to network reconstruction and molecular modeling — the three areas within computational biology that the DOE identified as the most data-intensive.

BioPilot is only one half of a larger DOE program to evaluate data-intensive scientific applications. Johnson said that a similar project is underway for high-energy physics at the Stanford Linear Accelerator Center. BioPilot overlaps only “conceptually” with some of the computational biology projects that the DOE is supporting under its Genomes to Life Initiative, Johnson said. “This activity is not a GTL activity, but it is one we’re pursuing because the Office of Science has a continuing interest in being able to solve hard problems in biology,” he said.

Johnson said that once the one-year pilot phase is over, the DOE may support a follow-on project, but there are no definite plans to do so. “There are a lot of options, and we’re very anxiously awaiting some results from this pilot to see what we could do next and see what kind of budget situation we have to work with,” he said.

Distributed vs. Shared Memory

Commodity clusters have risen to HPC prominence fairly rapidly. The biannual Top500 supercomputing ranking, which included only 43 cluster-based systems three years ago, listed 294 such systems in its most recent iteration [



12



11



+1



Linux Networx



6



/114589-1.html" target="_blank">BioInform 11-15-04].

But despite the popularity of these systems, most of them rely on a distributed memory architecture that comes up short for some research applications. Distributed memory is well-suited to “embarrassingly parallel” applications that break large problems into many independent tasks — the classic Blast farm, for example — but isn’t suitable for applications in which all of the pieces of the problem depend on the results of other tasks. Such applications require a level of inter-processor communication that would bring a typical cluster to a grinding halt.

One of the goals of BioPilot is to explore the use of shared-memory systems such as SGI’s Altix and Cray’s X1 for certain bioinformatics applications. These machines are more expensive than commodity clusters, but offer some benefits for data-intensive applications, Straatsma said. “They have the advantage that all the processors have access to all of the memory. So if we have large data sets that we need to keep in memory and all of the processors on these parallel machines need to be able to access all of that data, it would be a lot more efficient if we had the shared memory or addressable memory systems.”

PNNL and ORNL each have 128-processor SGI Altix machines that they will use for the project, and the labs will also use a 59-teraflop Cray X1 at ORNL. Straatsma said that the labs will deploy their algorithms on these machines along with the cluster systems they are already using “so that we can do a true comparison between the two different architectures, and [learn] what that means in terms of the efficiency of the codes we have to do these experiments.”

Most of the software tools that the labs will evaluate are already in use, Straatsma said. “We just need to change them in order to make efficient use of the architecture on which we run them.”

Benefits for Proteomics

One of these software tools is a de novo method developed at ORNL for identifying peptides in mass-spectrometry experiments. The approach uses characteristics of the raw spectra themselves to identify peptides, rather than mapping the peaks to a sequence database, as is done by common methods such as Sequest and Mascot. This approach could potentially make protein identification as straightforward as reading an electrophoregram in DNA sequencing, according to Andrey Gorin, an ORNL researcher working on the BioPilot project. Gorin said that standard peptide mass fingerprinting tools are useful for identifying known proteins, “but if it’s not in the database, how do you prove that you’re really finding something real?”

Gorin said that ORNL’s de novo method already works on a distributed-memory architecture, “but shared memory will allow us to dramatically accelerate what we could do, and that is addressing the problem that the proteome sample includes probably 20,000 individual spectra, and for each of those spectra you have to do a lot of look-ups” — on the order of billions or even trillions, he estimated — “when you suspect it is a spectra corresponding to a mutated protein.”

Straatsma said that currently, even though there are dependencies among spectra, “we treat them independently because each of the processors on the cluster computers is dealing with a number of spectra, but there is not a lot of communication between the processors, because that’s very inefficient.”

Straatsma said that a typical proteomics experiment only identifies between 15-20 percent of the peptides in the sample, “and in order to increase that percentage, we need to be looking at the dependency between the spectra, and we can only do that efficiently if we have a different type of architecture — at least that’s the idea that we have that we want to explore within this particular project.”

Straatsma said that PNNL’s proteomics data currently falls within the terabyte range, but the lab expects to begin generating petabytes of proteomics data within a couple of years.

“We want to be able to do comparative analyses from different species, and if we do proteomics experiments for a number of species, and for each species we do this under different environmental conditions, and then for each experiment we have to do replicates in order to get statistical accuracy, we’re very quickly getting into the petabyte range once we do the real interesting experiments,” he said.

Over the next year, he said, “we hope to make enough progress that we can show [the DOE] that we can continue this work to get production-quality analysis tools that others can use.”

— BT

Filed under

The Scan

For Better Odds

Bloomberg reports that a child has been born following polygenic risk score screening as an embryo.

Booster Decision Expected

The New York Times reports the US Food and Drug Administration is expected to authorize a booster dose of the Pfizer-BioNTech SARS-CoV-2 vaccine this week for individuals over 65 or at high risk.

Snipping HIV Out

The Philadelphia Inquirer reports Temple University researchers are to test a gene-editing approach for treating HIV.

PLOS Papers on Cancer Risk Scores, Typhoid Fever in Colombia, Streptococcus Protection

In PLOS this week: application of cancer polygenic risk scores across ancestries, genetic diversity of typhoid fever-causing Salmonella, and more.