Name: Lars Jensen
Position: Staff scientist, Peer Bork group at European Molecular Biology Laboratory 2006 to present
Background: Scientist, Peer Bork group at EMBL, 2005 to 2006; Postdoc work, Peer Bork group at EMBL, 2003 to 2004; PhD, Technical University of Denmark, 2002
Lars Jensen and collaborators from the European Molecular Biology Laboratory, the Samuel Lunenfeld Research Institute of Mt. Sinai Hospital in Toronto, and the Massachusetts Institute of Technology have created a computational method, dubbed NetworKIN, to help in the identification of kinase substrates.
The approach combines accepted sequence motifs with protein-association networks to predict which protein kinases target experimentally identified phosphorylation sites in vivo.
Their method in part depended on phosphoproteomic applications that identified such sites. For proteomics researchers, NetworKIN can help in the interpretation of data and allow them to go from the high-throughput data coming out of proteomic experiments to specific testable hypotheses that can be researched in low-throughput studies afterward, Jensen said.
Below is an edited version of a conversation ProteoMonitor had with Jensen this week.
Describe your approach.
What people have been doing for a long time is to look at protein sequences and try to identify so-called sequence motifs, that is, characteristic sequences that are targeted by certain modification proteins, called kinases. Now, a lot of these phosphorylation sites are now identified by mass spec studies.
The problem with that is they identify the sites, but they have no idea which of the more than 500 human kinases are actually responsible for the phosphorylation of a given site. That’s what we are trying to predict.
What most people would do is simply to use these motifs to say, ‘Is this site most like something that would be phosphorylated by this kinase or by that kinase?’
It becomes pretty much of a guessing game because the information content of the site is essentially too low, so the amount of information in the local sequence around this site is not sufficient to tell you which kinase is responsible.
Is there too much similarity among the phosphoryation sites to accurately infer which kinase is responsible?
Yes, they are very similar to each other and they are very weak. So for example, if you are looking at a phosphorylation site for a site-independent kinase, first of all, there are many site-independent kinases and the only recognition sequence that you typically have is a serine or a threonine followed by a proline.
And of course, that’s such a simple sequence that you’re going to have occurring all over the place. Also, if you have serine by a praline, not only could any one of these site-independent kinases phosphorylate the site, any one of the mitogen-activated protein kinases could also do it.
So the sequence of the site itself is, in many cases, insufficient to tell you which kinase is in play.
Of course, it’s obvious [that] information must be in there, because somehow, the cell is able to do it right. And the way the information is there in the cell is what we would describe as context. To have a kinase interact with a substrate, there might be other proteins involved that help it interact. It might be limited [in terms of] which kinases can do it depending on where in the cell they are, in which tissues they are expressed, so on and so forth.
So a large number of different circumstances — or context — is responsible for determining which kinases are phosphorylated at a given site.
What we’re doing is trying to build a context network where we use protein interaction information from high-throughput screens like yeast 2-hybrid screens and so, but also more functional interaction evidence such as co-mentioning literature that gives you some idea that these two proteins probably have something to do with each other. You can also look at co-expression in microarray studies. So we put together a large number of different types of evidence to try to build a protein network that tells us which proteins seem to have something to do with each other.
Now what we do is, if we have a certain substrate, a particular phosphorylation site, and we want to predict which kinase is most likely to phosphorylate this site, then first of all, we look at the site itself, and we see, for example, that it is an ‘s’ followed by a ‘p,’ which tells us it’s likely a site-independent kinase or a MAP kinase. That brings us down to the family level, possibly like in this case, multiple different families. Now we have to determine which of the many kinases belonging to these families are most likely.
Now we go into the protein network and we say, ‘Can we find a kinase of the appropriate type, which based on the network, seems to be functionally related to the substrate?’ Either we have a direct relation between them or [an indirect relation].
You’re somehow looking at a network neighborhood around the substrate. You’re looking for a kinase of the type that you’re after. That’s the algorithm.
It sounds like your method is based on some principles or other methods that have existed before.
We’re combining several existing things. The context network comes out of a database that I’m one of the developers of, the STRING [Search Tool for the Retrieval of Interacting Proteins] database. That database integrates a large number of different types of evidence such as literature mining, yeast 2-hybrid assays, and so on.
On the other side, we have the motifs, and those we also get from existing resources such as the NetphosK [database] and the Scansite database. So we’re taking two motif resources plus a protein network resource and those we’re using as a starting point.
When it comes to predicting phosphorylation events, no one has used a network before. The whole idea of using a protein network to give the context, no one has done that before. So far, every prediction method that has been developed has completely ignored the context.
I don’t really know why people have ignored it. It’s kind of strange because it’s been known to quite a lot of people that context is important. I guess part of it is just that it is fairly difficult to make the network. It’s easy for me to use it because I was already developing a database with that kind of data.
Other people not working on protein interaction networks, it would be a whole lot to grapple with to be able to do that.
What we were able to do that others couldn’t do before was two things. First of all, existing methods could only really bring you to a class of kinases. You could say ‘This is likely to be phosphorylated by a site-independent kinase,’ but you couldn’t say which one. Whereas by using the network, we are able to pinpoint which specific kinase we are talking about.
The other thing is that when it comes to just predicting at the class level whether it’s a site-independent kinase or something else that is involved, we have much higher accuracy. The accuracy of our method is something like two-and-a-half fold higher than any method before.
How sure can a researcher be that they’ve identified the right kinase using your method?
It depends a lot on which class of kinase. Some kinases have stronger motifs than others, and if you have a strong motif to start with, then you can do really well. In the best case examples, like some of the kinases involved in DNA damage response, we can get to predicting the class of kinase with 80 percent accuracy. For some other classes where we do less well, you’re looking at something on the order of 50 percent.
In terms of getting the specific kinase, that’s difficult to say how precisely we can do that because experimentally it’s very difficult to show which precise kinase is responsible for the phosphorylation. There’s fairly little data where the truth is known, so it’s difficult to benchmark.
Our guess is that of the cases where we get the class right, we’re probably getting the right specific kinase between one-third and one-half of the cases.
Are you in the process of testing this method and testing how accurate and specific you can get?
We’re working on extending the method to cover many more kinases. Also, there are a lot of predictions that we make with the existing method that are currently being followed up in the lab.
What are the applications for this method?
It helps us to dramatically speed up the pace in which you can deduce the signal transduction networks that are really run by kinases. The problem so far has been that the high-throughput methods can only identify phosphorylation sites, but they can’t tell you which kinase.
The methods for figuring out which specific kinase phosphorylates a given site are very much low-throughput methods, so you have to have a very good guess at which kinase [phosphorylated] a particular site in order to validate it. You can’t go out and systematically say, ‘OK, I’m interested in this site. Let’s try all 518 kinases,’ just because of the amount of work involved.
So by making better cases computationally at which kinases are likely to be the ones responsible, you can experimentally validate this. And of course, the more accurate predictions you have, the greater success you’re going to have in the lab.
Once you’ve achieved that, where would you take the research further?
In signal transduction, you have so-called upstream and downstream events. Upstream events, that’s the phosphorylation, so you have a kinase and a phosphorylated particular site.
The so-called downstream events involve having particular types of domains, perhaps the best known being the HS2 domain that binds to a particular site only once they have become phosphorylated. You have a kinase phosphorylating a site, and subsequently another protein binds to it.
Now in principle, it should be possible to use the same strategy for predicting what’s going to happen downstream. If you have motifs for which types of sequences are recognized by different classes of HS2 domains and other phosphor-binding domains, then combine that again with the same context network, then using a very similar strategy, you should be able to deduce the downstream signaling events that would stem from the phosphorylation.
That would really allow us to determine kinase substrate networks, but actually whole signal transduction pathways because we can say, ‘When kinase A phophorylates protein B, then protein C subsequently binds to it.’