Parabon Computation is testing the limits of its distributed computing platform in a protein folding study being conducted by the University of Maryland’s department of chemistry and biochemistry.
The researchers, led by Devarajan Thirumalai, will use Parabon’s Frontier computing platform to perform protein folding simulations.
Parabon is providing access to Frontier free of charge as part of its Compute Against Cancer program, an initiative to donate computation cycles to non-profit cancer research programs. In turn, Parabon will gain a better understanding of the platform’s ability to handle the complex data analysis associated with protein folding studies.
“The type of programs that Dr. Thirumalai is running across our system represent some pretty significant computing challenges,” said Mark Weitner, vice president of sales and marketing at Parabon. “It has helped us understand the depth of our solution and to continue to build upon the platform.”
The intense computational power demanded by protein folding simulation has given rise to a number of projects focused on accelerating the process. IBM chose protein folding as the first “grand challenge” to address with its Blue Gene supercomputer project, which is expected to be capable of more than one quadrillion operations per second when completed in 2004.
Two other protein folding projects are using a distributed approach similar to Parabon’s, which works by dividing large computer jobs into small tasks and sending them over the Internet to thousands of computers. Folding @Home (http://foldingathome.stanford.edu), run by Vijay Pande of Stanford University’s department of chemistry, uses multiple processors to compensate for the 1,000-fold gap between the nanosecond timescale of computer simulations and the microsecond timescale at which the fastest proteins fold. And Folderol (www.folderol.org), an independent project led by ex-gamer Scott LeGrand, uses distributed power to improve statistical analyses on theoretical structure prediction techniques.
Now that Linux clusters are becoming the standard, Pande said, distributed computing is considered “bleeding edge research” for computationally complex problems.
Thirumalai himself admitted to being skeptical when Parabon offered to run his algorithms on its Frontier platform. “I had known about [email protected], but I didn’t really think about getting into this [for protein folding] because my own approach has been involved more in thinking about the theoretical aspects of the problem than purely computational,” Thirumalai said.
After Parabon transported the codes from his Alpha cluster to Frontier and successfully ran a sample folding problem, Thirumalai was pleasantly surprised with the results. “I’m quite impressed with what has been done so far,” he said. “Some computations which normally take about a month or two on dual processors can now be done in a day.”
While Parabon converted Thirumalai’s code to Java in order to run the algorithms on Frontier, [email protected]’s Pande opted for C and Fortran because Java can run slower when run over thousands of processors, he said. With 16,992 registered users, Pande has been able to drastically speed his analysis. He has already submitted a paper to a leading scientific journal, which he said would be the first protein folding research published based on a distributed computing approach.
But while distributed computing appears to be a promising method to address the complexities of protein folding, Folderol’s LeGrand cautioned that “this problem has been unsolved for over 40 years and there’s a good reason for that.”
He said that many questions remain regarding the effectiveness of the molecular dynamics simulations of [email protected] and Thirumalai’s project. “The hope that everyone has pinned on this is if you take one of these simulations out to the one-millisecond range, you’ll fold a protein, but of course that question has never been answered,” he said.
Rather than recreating the folding process, he said, his project uses techniques that performed well in Lawrence Livermore National Laboratory’s CASP (critical assessment of techniques for protein structure prediction) competition, in which structure prediction techniques are judged based on their similarity to known structures obtained through crystallography and nuclear magnetic resonance.
LeGrand relies on the distributed system more for its statistical potential than its computational power, he said. “I’m using distributed computing to generate vastly more structures than I could normally generate so I can perform a more thorough statistical analysis on what’s generated. One of the things that you really want when you run an analysis is a lot of instances so you have a distribution.”
Further questions center around how these distributed efforts will stack up to IBM’s Blue Gene project. Both Pande and Thirumalai, however, think that IBM will be open to incorporating some lessons learned from their respective projects.
Pande suggested that a hybrid method that combines clustering and distributed computing might be the best approach to the problem. Each 1,000-processor chunk would serve as a single processor in a distributed system of 1,000 units, he said, effectively scaling to the million processors planned for Blue Gene.
Said Thirumalai, “Perhaps this new avenue will have taught us a number of lessons about the real serious limitations or requirements of the problem itself, so that will serve very well for the IBM people to avoid some of the mistakes they may ini-tially make in terms of applications.” He recommended that IBM make its intermediate architectures available to researchers to evaluate before Blue Gene reaches the petaflop level.
Until Blue Gene is up and running, however, distributed approaches may be the best option for solving the compute-intensive challenges of protein folding and other biological simulations. Thirumalai’s most recent job, for example, took six days and consumed 22 GF of power. This is equivalent to a 250-node cluster, according to Parabon’s Weitner, who noted that a similar job would take four years on a single machine. LeGrand said that the 10,000 simulations he needs to run for each complete folding experiment take a month with Folderol, while they would take 13 years on a single machine, and Pande said the work that [email protected] has done so far would have taken 1,500 years on a single machine.
Parabon will build on the protein structure experience gained through its work with Thirumalai in its next offering, a genetic algorithm-based framework for protein threading that will run on its Frontier platform. The application should be available in the second quarter of the year.
Pande is currently gearing up to release [email protected] 2.0, and has recently launched a new effort, [email protected], to design new protein sequences. LeGrand and Thirumalai both intend to submit the results of their work for publication in the next few months.
Thirumalai said Parabon’s system offered him “the possibility of entering into problems that I might not have thought about doing — really trying to computationally fold longer proteins, which I might not have done if the computational resources were not available to me.”
“I’m actually quite optimistic that there’s a considerable future in this game,” he said.