At IBC’s Drug Discovery Technology conference in August, Frank Brown, senior research fellow and chemoinformatics team leader at Johnson & Johnson Pharmaceutical Research & Development, discussed a method his team developed for “virtual” high-throughput screening based on the idea of screening fewer, information-rich compounds.
BioInform recently caught up with Brown to find out more about this method, and to get his thoughts on the interface between cheminformatics and bioinformatics.
The virtual screening method you discussed at DDT wasn’t an in silico docking program, but a way to screen fewer compounds to get a higher hit rate. What is the role of informatics in this approach?
The issue really was, we had all these millions of data points and management wanted me to be able to mine it and tell them all kinds of information. And we started to do that on quite a few assays, and what we found was that on the first run of HTS there were no models we could build that you could actually get signal to noise out of them at all. Then we noticed that if we took just the confirmed data, which is about 2 percent of the overall data — so lets say 2,000 data points out of 100,000 — and we looked at those data points, then we could actually use the [structure-activity relationship] models we built from that information, and then we could go back and pull out all the actives from the 100,000 without a problem. So we proved that you could get more information out of less data. So data itself is not information.
Now why is HTS at 100,000 [data points] not more information-rich than at a couple thousand data points? The issue there is if you have a 1 percent hit rate, which is a fairly high hit rate for HTS, and you have an assay that is 99 percent certain, which is a tremendously good assay — so I’m taking the best of all worlds — then I have one percent error and a one percent hit rate. I have an equal amount of misinformation as information.
So we took those supposed actives from primary [screen], and then found out which ones confirmed. And if they confirmed, we called them a 1 for active, and if they didn’t confirm, then we called them a 0, and we could model that. What we showed was that you could take a sampling of the database — we happened to sample it by clustering into chemical families. … We described it three different ways and clustered it three times, and then picked representatives out of those. … And if you think about a Venn diagram, that allows us to say that if one cluster hits in one method, it will point us back to the other cluster that’s missed in another one.
Then we were able to show that this consensus of clusters was much more adept at finding actives, to the point where we regularly find about 75 percent of the actives [by] testing about one-third or a little bit more than one-third of the collection.
Is this type of informatics capability generally available with most HTS systems, or is it something that pharmaceutical companies typically design on their own?
The components are available. So there’s clustering software, there is mathematical modeling software, there are web interfaces. But what we were able to put together was a simple interface where they type in the number of their assay, and what they say is active, and from that point on, everything is just done. So we’ve built a very simple application that allows them to do what I call one-click science.
So it sounds like you added the glue to hold these components together.
We had to integrate them. That’s a little bit of a simplification in the sense that we had to do good science to say how each one of these components gets put together, but once we knew that, that’s why we had to do it in a scientific way, by stepping through it and figuring out what was the best process at each point. But once we knew that, we could assemble it.
Is that kind of approach common in cheminformatics? Compared to bioinformatics, it seems that there are a lot more commercial offerings for individual applications, but is there still a lot of in-house development when it comes to integrating them?
There is. There are two ways to do it: You can either build your own framework to plug them all in, or you can buy frameworks. But that’s just a philosophical difference, and I think it’s also the difference between big pharma versus biotech. A lot of biotechs will build their own, but a lot of big pharma likes to buy versus build.
Is it ever more than a philosophical difference? Does it matter in terms of productivity or quality?
There are cost advantages and disadvantages, there are capability advantages and disadvantages, and it’s six of one, half a dozen of the other. I think there is no way to prove either one completely wrong or completely right.
You said at DDT that you’re moving into the area of chemogenomics now. What are you doing in that area, and how are you using biological information in the context of the chemical information?
What we’ve been able to show is that, as we’ve collected millions of data points in our database, it’s very difficult to sift through those unless you have some 10,000-foot view of it. So what do you use? What I use is — instead of using [the terms] chemogenomics or systems biology anymore — I call it biological classification.
So the idea is that any time you have too much data you need to put them into stacks that are organized in some way you can make sense of them. And then that stack can be further divided, into smaller stacks and smaller stacks and smaller stacks, until you get to the point where you have a bird’s eye view on the data. So that’s really what I think chemogenomics and so forth is doing for us. It’s taking tons of data and sorting it either into a biological classification or a pathway … and associating lots of data into that group.
One example where I used classification is in an area like kinases. You may have 20 assays in each [type of kinase] in your organization. So this thing would scan every classification that we have, and we have several — maybe 75 or so different classifications — and it would say OK, you have 5,000 hits on 500 molecules, and of those 5,000 hits, 1,000 of them were in tyrosine kinases. Then you might use your list of actives, and say tell me which assays these hit in. Now I take all the assays underneath the umbrella of tyrosine kinases, and look at those. Maybe I have 15 that are active in four. So then you might say, for these four, how similar are their sequences? Now I’m into bioinformatics. I’m diving from chemistry all the way back to the sequence. So that’s really chemogenomics.
And maybe only one of those four has a crystal structure in the public domain, but the sequences we know, so then we can then do homology modeling and see how maybe a one-residue change in the active site makes one molecule active and another one not active. We’re to the point now where you can get right down to those molecules. What we haven’t done is jumped into the sequence world. However, we are close, and hope to do that soon.
So this is something that would still involve some collaboration between the bioinformatics and cheminformatics worlds?
Right, and I think that’s exactly where the junction starts — when it involves mapping assay data to the sequence — because now they’re the masters of the sequence world and we’re the masters of the chemical world. We understand how to manipulate molecules and break them down into their smaller pieces. They really know how to join and cluster and bring together sequences. So the worlds start coming together when you want to map how a molecule might actually interact in that context.
So in some sense, it’s back to your original question. [My talk at DDT] was about the chemical description itself, and not docking it to an active site. There’s a reason for that. We do [docking] quite a bit, but it’s not as successful as what we’re doing now in HTS. And that’s because no one can quickly give you a good score for how well that molecule docks into that site in a particular pose. That’s the failure of docking; it doesn’t work real well. It allows you to find the molecules that will fit in the box, but we can’t really do a great job of predicting exactly how it fits in the box. There are a whole bunch of molecules that just won’t fit in the box, so it’s a great success in eliminating 90-some percent of your collection, but it’s not very good at telling you exactly which ones are the best ones.
But we’re doing new things in that area that are even faster, so that is something we’re interested in. And if you then compare one kinase to another kinase in the sequence world, then you’re really starting to bridge bioinformatics and cheminformatics — when you need to bridge chemical families to structural families.
So is that a technology bridge, or more of a collaborative and social bridge?
It has to start as a collaborative and social one so that we understand each other’s space a little bit better. And then it becomes a software nightmare, because cheminformatics doesn’t even interact together very well, let alone jumping right into their space. So we need to find a way to bridge those components together as well.