By John S. MacNeil
When Charles Brooks began taking an interest in CASP, the Critical Assessment of Techniques for Protein Structure Prediction, it didn’t take long before the Scripps Research Institute computational biologist realized he might need some help.
Brooks’ problem with CASP, a biannual contest to compare various approaches to predicting protein structure from amino acid sequence, was less theoretical than practical: Like many researchers who have tackled the CASP challenge, Brooks and his group members found the computational requirements formidable. Typically, he says, algorithms for protein structure prediction work by searching the high-dimensional space that encompasses all the possible folding patterns for a given protein, and then applying energy minimization and scoring functions to pull out the structures most likely to occur in nature.
In the two most recent CASP competitions, Brooks says his team developed effective scoring functions, but only a limited ability to sample all the possible structures a protein sequence could take. Brooks’ group had access to several hundred Linux Beowulf clusters during CASP4 and CASP5 through a shared facility, but a retrospective analysis indicated that the effort was resource-limited, he says.
So instead of investing in heavy-duty hardware or renting time on a supercomputer, Brooks considered a more creative solution. Why not try an Internet distributed computing solution, modeled after the [email protected] project, which employs thousands of individual PCs across the Internet to search for signs of extraterrestrial life? Brooks had avidly followed a related project designed to probe the mechanics of protein folding, called [email protected], and now he wanted to see whether he could use a similar approach to predict the final structure of specific protein sequences.
What made the approach feasible, Brooks says, was the development of an off-the-shelf program to serve as the interface between worker PCs and the administrative server responsible for collecting and collating the computational results. Earlier projects like [email protected] were forced to write their own middleware, but in 2002, researchers led by David Anderson at the University of California, Berkeley, developed the Berkeley Open Infrastructure for Network Computing — known as BOINC — and Brooks jumped on the opportunity to participate in a pilot project to try it out.
Brooks’ investment in hardware was relatively minimal. With the help of postdoc Michela Taufer, a graduate student, and two undergraduates in the computer science department, he set up a 3GHz dual processor Linux box equipped with a terabyte RAID system — at a cost of about $10,000 — to serve as the administrative server. Next he began recruiting volunteers to download the software and individual problems to their PCs.
Since beginning the project in early June, Brooks has assembled 5,000 users from 80 countries, thanks to word of mouth and a link on the Berkeley site that administers BOINC. As of mid-July, Brooks’ effort had accomplished the equivalent of about 4 million CPU hours with only a few glitches such as database crashes and power failures, he says. “These are to be expected,” he says, “[but] we have had no real problems with data quality. We distribute work with homogeneous redundancy so that we can verify all findings.”
Ultimately, Brooks hopes to use the [email protected] experiment to understand how to package “how-to” scripts that other computational biologists without extensive computer science expertise can employ to set up similar projects. “It’s my hope that we can utilize this type of infrastructure to provide service-level assistance to biologists for problems of structure prediction, loop modeling, and homology modeling in the context of our NIH Research Resource once CASP is done,” Brooks says. In other words, for certain problems finding an Internet distributed computing solution may be easier than you think.