Distributed computing — often touted as a breakthrough for the computational demands of protein-folding simulation — may not be as effective a solution as originally thought, according to several recent peer-reviewed papers.
In October 2002, Alan Fersht of the Cambridge University Center for Protein Engineering first noted some “inherent problems” of the approach in a paper in the Proceedings of the National Academy of Sciences [PNAS 99(22): 14122-14125]. Last month, two separate papers confirmed Fersht’s doubts: a computational study by Emanuele Paci and colleagues from the biochemistry department at the University of Zurich in PNAS; and a wet lab experiment from Jan Kubelka and colleagues from the National Institute of Diabetes and Digestive and Kidney Diseases in the Journal of Molecular Biology.
At issue is the way the distributed approach splits up the protein folding problem: A single molecular dynamics simulation on the time scale of tens of microseconds would take thousands of years of CPU time, so distributed projects perform tens of thousands of very short simulations, on the scale of tens of nanoseconds, via screen-saver programs on PCs spread across the globe. The problem is that the early steps of the folding mechanism are not typical of the process as a whole — the protein squirms a bit as it takes its final shape, with the most twisting and turning at the beginning. Splitting the process up into short bits causes atypical folding pathways to be overrepresented in the final simulation — the few, rare, instances where the protein snaps into its final form within a few nanoseconds can bias the entire process.
“It’s like if you want to know what the roads between Chicago and New York are, so you take a million people and you study how they go back and forth,” Paci explained. “But if you only leave 20 minutes to go from one to the other, you will only see the two or three lucky ones who take a plane, so you will get a pathway that exists, but doesn’t give you any idea of the landscape between the two cities…It won’t be the path that the majority of people would take.”
Paci and his colleagues tested the distributed computing approach on GS, a 20-residue beta-sheet peptide short enough to simulate on its natural, or equilibrium, time scale. Using a 100-processor Beowulf cluster, his team took 500 days to run the equilibrium folding simulation along with 14,300 one-nanosecond simulations to mimic the distributed approach.
They found that the estimated folding time using the distributed approach was accurate only when the simulation times were around 100 times shorter than the average equilibrium folding time. Shorter trajectories — 100-500 times shorter than equilibrium — led to errors due to “the peculiar behavior of the fast folding events that consist of atypical sequences of conformational transitions, not representative of the major folding pathways.” Paci and colleagues noted in their PNAS paper.
Kubelka’s experimental approach resulted in a similar conclusion, with an additional, biologically significant finding: A substitution event in a subdomain of the protein villin, predicted by Stanford’s [email protected] project to increase the folding rate, was shown to have no effect on the actual folding rate.
This finding is bound to stir up discussion about [email protected] and other distributed protein folding projects, which have attracted abundant interest from the bioinformatics community as well as the general public. [email protected] (http://www.stanford.edu/group/pandegroup/folding/), launched in 2000, now boasts 227,486 registered users, and the Distributed Folding project at the Samuel Lunanfeld Research Institute (http://www.distributedfolding.org/) claims 20,680 registered users. Even Intel and Google have jumped on the bandwagon, pitching in compute cycles for [email protected] and generating a fair bit of press coverage for the project.
Distributed computing may not be perfect, Paci said, but there are very few alternatives right now — the tiny peptide his lab used is not typical of the much larger molecules of interest to most research groups. “It’s obvious that one linear simulation can’t be done” using current approaches, he said. Even IBM’s ambitious Blue Gene project has shifted from a linear approach to a hybrid linear/distributed approach, he noted.
“Any discussion of how to use massively parallel computers today is good,” Paci said. “I think the distributed computing approach is very interesting because it triggered this debate.”