The Department of Energy awarded a group of researchers at the University of Washington 2 million hours of supercomputing time last week to study how proteins fold and how amino acid sequences predict protein structure.
The award was one of three given by the DOE as part of the Innovative and Novel Computational Impact on Theory and Experiment program, which was introduced in July 2003. The program aims to select a small number of computationally intensive, large-scale research projects that can make scientific advances through the use of a substantial allocation of computer time and data storage at the DOE’s National Energy Research Scientific Computing Center in Berkeley, Calif.
Valerie Daggett, a professor of medicinal chemistry who heads the research group at the University of Washington, said using the supercomputer will allow her group to run hundreds of protein-folding simulations, using up to 1,000 processors at a time.
“We are excited about the massive resources we will have access to now,” said Daggett. “Using the clusters of computers we have now [in our lab], there’s no way that we would have been able to do this in five years. We will be putting much more information into the [protein-folding] algorithms than there currently is — that should bootstrap us to better structures.”
Daggett said with this project, her group plans to study 1,130 different proteins to see how they unfold from their native structure to a denatured structure.
“Structure prediction remains one of the elusive goals of protein chemistry,” Daggett wrote in her INCITE proposal. “It is necessary to successfully predict native states of proteins, in order to translate the current deluge of genomic information into a form appropriate for better functional identification of proteins and drug design.”
Daggett’s group has already simulated the folding of about 30 proteins, which represent about 50 percent of all known folds. By studying those proteins, some general principles of folding have been culled out.
“We’re finding that at the sequence level, different amino acids have different kinds of secondary structures in denatured states. Different side chains tend to find each other even in the denatured states, which helps get the folding process started,” said Daggett.
Daggett said that additional data from the 1,130 proteins should help map protein folding beyond the starting point to intermediate and transition states, as well as to cull out crucial interactions that push the protein folding process along the way.
“We can already map in great detail for the folding process for individual proteins, but now what we want to see is what’s more general, and to take that to the sequence level,” said Daggett.
One of the biggest challenges of the project will be managing the amount of data that it produces, said Daggett.
Daggett estimated that the project’s data will take up about 77 terabytes of space. The group hopes to eventually turn the data into a publicly accessible database that can be searched.
“It would be like the protein database, but with much more information, with movies on how proteins move and things like that,” said Daggett. “Then people could, for example, do searches with coordinates to find out what’s the prevalence of a certain type of protein interaction.”
Making the data publicly accessible will be challenging because all the initial processing is done within a firewall, and measures will have to be taken to make sure that the original data is safe, hopefully without having to duplicate massive amounts of data, Daggett said.
The research group is currently in the process of setting up accounts with the DOE to use the supercomputer. The 2 million hours of computing time should be used up by the end of November, Daggett said.
“Then the real fun starts with the science,” she said. “With trying to cull out the general principles of folding, having more information than we’ve ever had before.”
In addition to providing insight into the dynamic structures of proteins, Daggett envisions that the new study will also provide insight into numerous diseases related to protein unfolding and aggregation, such as Alzheimer’s disease and bovine spongiform encephalopathy, also known as mad cow disease.
Daggett’s proposal was one of 23 that were submitted to the 2005 INCITE program. The other two projects that won computing time were a project by the Sandia National Laboratories of Livermore, Calif., to gain insight into ways of reducing pollutants and increasing the efficacy of combustion devices, and a project by a group from the University of Chicago to study how stars and solar systems form.
The Livermore project was awarded 2.5 million computing hours, while the Chicago project was awarded 2 million computing hours.
“The level of pent-up demand for dedicated time on supercomputers highlights the fact that computational science is playing an increasingly important role in advancing scientific and technical research at national laboratories and universities,” said Secretary of Energy Spencer Abraham, following announcemnt of the DOE awards. “The quantity and quality of proposals for this year’s INCITE program clearly shows the need for increased supercomputing resources to address issues that affect all of us.”
Daggett said the 2 million hours of computing time will probably allow her group to simulate about half of the 1,130 proteins she wants to study. She plans to first go through results of those simulations before applying to the DOE for more supercomputing time.
“We’ll see how the first round goes, and if it’s looking promising then we’ll try to work our way all through the list,” she said.