A new approach developed by researchers at the Massachusetts Institute of Technology and elsewhere promises to enable protein-folding simulations for very large proteins with unknown structures in minutes on a single CPU — a process that would otherwise require hundreds of thousands of CPU-hours with all-atom molecular dynamics simulations.
The approach, based on a method called ensemble modeling, first samples the complete conformational landscape of large proteins based on sequence data alone, and then builds a "coarse-grain" representation of the protein's energy landscape that is used to model intermediate folding states and, subsequently, the folding process.
The researchers, from MIT's Computer Science and Artificial Intelligence Laboratory — in conjunction with collaborators at McGill University, Boston College, and MIT's biology department — have been working on the approach for several years. They recently developed an improved version of the method for beta-sheet proteins, called tFolder, which they will present at next week's International Conference on Research in Computational Molecular Biology. In addition, they will present a paper that describes a version of the approach for use in amyloid proteins at the Intelligent Systems for Molecular Biology conference in July.
Classical molecular dynamics methods, which simulate protein folding on an atom-by-atom basis, are the gold standard in terms of accuracy, but are so computationally intensive that they are limited to small proteins. Other approaches, so-called motion-planning methods, compute potential intermediate structures in order to speed the simulation time, but they require the three-dimensional structure of the protein's native state and therefore do not work for proteins with unknown structures.
The ensemble approach "reconciles the MD and motion-planning approaches for studying folding pathways," according to its developers, by simulating large proteins with unknown structures in a reasonable amount of computing time. While the method is not as accurate as MD approaches, it "greatly expands the number of proteins whose folding pathways can be studied," they note in the RECOMB paper.
The approach is particularly promising for beta-sheet proteins, which are very difficult to simulate because they are "stabilized by inter-strand residue interactions, and thus the folding and assembly of these structures is largely influenced by long-range interactions and global conformational rearrangements," they said.
Charles O'Donnell, a doctoral student in the department of electrical engineering and computer science at MIT who developed the method along with Jérôme Waldispühl, an assistant professor of computer science at McGill University, told BioInform that ensemble modeling generates "representations that are flexible so that one can choose the best trade-off between accuracy and speed for a particular question."
In other words, the ensemble technique provides a "high-level description of a protein's shape" without incorporating specific details such as which atoms are located in what positions and what each atom's velocity is, he said.
The method relies on a statistical-mechanical approach where "the idea is that you describe a protein's potential conformational space and calculate the energy of each state to come up with the partition function," O'Donnell explained. "From that you can find out interesting statistics about the ensemble," for example, the likelihood of any given protein structure occurring.
As to any concerns about the trade-off in accuracy for faster results, O'Donnell pointed out that the goal is to give a "quick approximation" of a potential protein pathway or a likely structure that researchers can test experimentally — "not an end all where you run this through and you have a final solution."
Indeed, O'Donnell and colleagues note in the RECOMB paper that tFolder "complements the use of MD simulations as the MD can be used to explore the nuanced structural interactions that certainly occur near a transition highlighted by tFolder."
So far, O'Donnell said, the researchers applied the methods to predict multiple protein structure states; identify important sequence mutations that control structural variation; perform comparative analysis of multiple proteins by simultaneously aligning protein sequences and predicting their structures; and to predict "folding dynamics by calculating ensemble structure states, modeling protein folding as a Markov process, and using a master equation to simulate population dynamics over time."
MIT said in a statement that when the method was used to predict amyloid structures, its results matched currently available data with 81 percent accuracy compared to similar methods whose results only matched 42 percent of the time.
The upcoming paper describes what software was used for the comparison although O'Donnell notes that these results may vary to some extent because little is known about amyloid structures. Furthermore, he said, most structure prediction tools have "slightly different goals."
In the RECOMB paper, the authors evaluated the accuracy of tFolder by comparing it to software designed for only one aspect of the method — inter-strand residue contact prediction. They noted that tFolder performed "comparably" to the SVMcon and BETApro algorithms designed for this task, though the specialized tools perform better in some cases.
The ensemble methods have been implemented in a suite of tools available through MIT and McGill called Partifold, and the team plans to make tFolder and the amyloid tool available on MIT and McGill's sites respectively once the papers have been presented.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.