AT A GLANCE
NAME: Ajay Royyuru
POSITION: Manager, Structural Biology Group, IBM Research Computational Biology Center
As the CASP5 showdown approaches, the editor of ProteoMonitor’s sister publication BioInform spoke with Ajay Royyuru, an IBM researcher whose group is developing new strategies for using sequence data to predict protein structure.
So you’re the CASP guy?
I am. It’s a little bit uncomforting to be labeled as the CASP guy. I’d rather be seen as a biologist attempting to understand protein structure and compute things related to protein structure. I prefer that label. CASP is really good in the field, I think. It certainly motivates us as well, but I certainly don’t see it as our only reason for existence. The science is a lot more motivating. CASP is just one of the checkpoints to establish progress, both in our hands as well as how we do with respect to how others are doing in the field.
When we last spoke, you had discovered the hydrophobicity ratio [a technique used as a scoring function to assess structure predictions based on hydrophobicity/hydrophilicity of interior and exterior residues] and you expected that to really push your work along. I’m curious to find out what progress you’ve made since then on that aspect of your work.
We discussed that just as [IBM researcher] Dave Silverman’s paper came out in PNAS. Dave’s paper described the hydrophobicity profile as a quantitative concept, and his paper showed how the profile behaves for 30 hand-picked proteins. We really want to do this across all known proteins and see how well it behaves. Dave’s statement was, “Oh it will, it will.” I liked his optimism, so I took on the task of putting the code together and streaming every known protein and basically processed all of PDB.
In one round we looked at all the single chains from PDB, but we were very cognizant of the fact that many of these single chains might actually be multiple domains on the chain. So in the second round we actually took each domain from PDB. What we found is that in the majority of proteins — and I need to qualify this majority — we are looking at proteins that are 70 amino acids or more, and we are looking at proteins that are in the soluble globular SCOP domains, which excludes membrane proteins, small peptides, modeled proteins and so on. We find the profile very well-behaved amongst those.
The profile is computable for other proteins smaller than 70, but with the math that is involved in the nature of the computation, you don’t accumulate enough signal for smaller proteins. One thing that we’ve now learned by looking across all proteins in SCOP is that we seem to be characterizing an attribute of proteins that behaves better and better as the protein size grows, which is quite different than other scoring functions that are typically a summation of pair-wise terms. Because you’re summing more and more terms, your potential energy surface tends to get increasingly rough as you move to bigger proteins. So for people who do, for example, threading potentials based on pair-wise terms or things of that sort, bigger proteins are a challenge, because with misfolded or even grossly misfolded proteins often times you can not distinguish them using pair-wise potentials. I’m not saying all of them, but many of them have this sort of a drawback because you’re summing up a huge number of terms and looking for a small difference.
Those sorts of potentials tend to degrade with size, whereas we seem to have a potential that improves with size because we accumulate more signal as the protein grows. We are characterizing something that is somewhat complementary or orthogonal to other characterizations. One does not eliminate the need for the other, it just means that you need to use things in conjunction. It’s a good thing to have two or more such potentials in conjunction because then you’re sensing two different aspects of the protein. It’s better to be right by two independent measures than to be wrong by just one measure.
So that’s our hope, that we have a different way of assessing whether the protein is well-folded or not.
So what does this mean for your work going forward? How are you combining this method with pairwise methods and other structure prediction methods?
From a structure prediction point of view, we understand that this method in the best circumstances will possibly detect folds, which is really good news for us if it works that way. CASP will teach us whether it indeed works that way, and also other tests that we’re doing along with CASP. So the jury’s out, but in the best circumstances I think we will be able to detect folds. [However], once we land in the fold, we will not be able to sense minor variations in the structure, so we need other ways of being able to tell how a two ångström structure is better than a three ångström structure. That’s broadly the strategy that we are pursuing at this point. Given all possible alignments of sequence and structure, we pick the ones that score well by the hydrophobicity profile — we are not necessarily limiting ourselves to one such good profile — and then ranking or scoring those structures and picking the ones that are better by some other measure.
Are you developing your own methods for this secondary process?
Yes, we are developing some methods. I feel hesitant to get into too much detail on that, one because a lot of it is not published, and second, we ourselves are not confident about how well it works or not. It might very well change.
But we are certainly trying to exercise this orthogonality of scoring systems. We want a secondary scoring system that is as detailed as possible and uses pairwise potentials as maximally as possible. That’s the combination we’re looking for.
Did you use this hydrophobicity scoring function at all for CASP4?
No, no, we weren’t at this point at all last time. Dave’s work on this happened mostly before CASP4. We had no notion of hydrophobicity playing such a crucial role in how to quantitate that.
I expect that we should actually be able to do a lot, but I’m sure we’ll make new mistakes this time. That’s the nature of this whole exercise. You recover some of your previous mistakes and then you go figure out how to make some new mistakes!
I guess the problem is when you keep making the same mistakes.
That’s why I appreciate the CASP exercise. Forget about the competitive aspect. In my mind that doesn’t figure at all. For me, this is what we did last time, and very clearly we identified proteins into the right fold bucket and we had large RMSD. To me that seems like a really big blunder. That’s the thing that I want to correct this time and I don’t think it will be 100 percent, but certainly in greater measure than last time I want to see if we are in the right fold bucket, I want a lower RMSD structure, and that’s primarily a scoring function issue. Of all the possible ways you can get into that particular fold bucket, I want to be able to tell the one that has the lowest RMSD.
And that for me is comparing my performance or ability to predict structures this time with respect to what I did last time, regardless of what everybody else is doing. CASP is nasty from that point of view. Everybody puts the focus on the ranking and scoring of you versus everybody else, but in my mind, that doesn’t figure very heavily actually, because I have a measurement or a checkpoint that we established last time, and I know that these are the things I want to correct. I’m approaching this from a purely instructive point of view. If we understand what we did last time and attempt to correct what we did wrong last time, we should be able to do better, specifically on those things that we’re trying to work on, compared to ourselves last time.
Where does this work fit in within the larger scheme of the computational biology group in IBM Research?
It fits very well. [Joe Jasinski, head of IBM’s computational biology group] has this really nice way of stating what computational biology at IBM is about. I subscribe to it wholly. Basically, CBC is about doing exploratory and very basic research at the interface of information technology and biology. The work we do in structure prediction is just one piece of that. We have all this information about sequences, structures in PDB, protein classifications in SCOP and so on, so we try to leverage all of that to gain more knowledge in biology. And that helps us, actually, provide knowledge, tools, techniques to people that we have a dialogue with — customers or collaborators. Structure prediction is an activity that many people in the field view as an essential post-genomic activity. We have sequences, but no knowledge of what these sequences might be up to, and any means of being able to derive information from the sequence is very useful in understanding what the protein is up to.
As we have dialogues with customers and collaborators and so on, we often hear that that’s a challenge they face. An example is our structural genomics partners. They’re out there trying to solve structures. For them, any sequence is game. What they’d like to know is which ones have a known fold, and then they’d like to be able to extrapolate function from that structure or sequence ¯ they are doing that for new folds. That’s a necessity in the time window of the next five years as the fold space becomes populated. All methods of being able to predict structure are immensely useful within this time frame.