BOSTON — Pharmaceutical researchers working in the area of computer-aided drug design are coming up against the limitations of current methodologies, prompting some to question the industry’s reluctance to share data.
During a panel discussion at CHI’s Virtual Screening and Structure-Based Drug Design Conference, held here last week, several researchers from pharmaceutical firms said that additional structural data for proteins and inhibitors will be required to refine current computational approaches. However, with very little of this information available in the public domain, these pharma groups are currently limited to what they can generate behind their own walls.
“There must be some way for us to share that data without legal restrictions,” said Prabha Karnachi of the CADD group at Johnson & Johnson Pharmaceutical Research and Development. “Pharmaceutical companies have a lot of that information just sitting there, but it’s not easily accessible.” Pooling that information, Karnachi said, would be “the next step” toward advancing current methods.
Dan Cheney, principal scientist in the CADD group at Bristol-Meyers Squibb, said that “sharing data with collaborators has been a nightmare” for his group. Pharmaceutical companies tend to “err on the side of being cautious” when it comes to data sharing, he said, requiring outside collaborators to come into the company to work directly on BMS computers in some cases. “We need some kind of overarching infrastructure that facilitates the sharing of data,” Cheney said. “We need to figure out a way of partnering across company lines and across egos.”
Experimental structural data from X-ray crystallography and NMR is a crucial component for improving virtual screening approaches. A particular stumbling block for developers now is the ability to move beyond so-called “rigid” docking methods and account for the flexibility of proteins and ligands. Additional examples of real-world protein-ligand complexes will be one way for algorithm developers to improve the accuracy of their methods, but these structures are hard to come by.
Richard Friesner, a chemistry professor at Columbia University and developer of the Glide docking algorithm marketed by Schrödinger, estimated that a “large, robust database of 100 receptors and 100 ligands per receptor” would give software developers a much better idea of how well their methods perform. “If there was a way to work together to generate that data set, that would be good, but I guess we’re on our own,” Friesner said.
It’s clear that current computational methods could use some improvement. Friesner noted that even in “self-docking” experiments — which use a ligand that is known to bind in a particular way to a protein — Glide and other current docking algorithms still get it wrong about 10 percent to 15 percent of the time.
Sandor Vajda, professor of biomedical engineering at Boston University, provided another example of the limitations of current approaches, pointing out that a computational model on Accelrys’ website used to illustrate its Insight II software is “totally and absolutely wrong.” The model, which illustrates how the MCSS (Multiple Copy Simultaneous Search) algorithm maps the binding site within the protein, shows a number of functional groups docked at different binding pockets. The problem, Vajda said, is that these molecules have been experimentally proven to bind at the same site in the protein. They should overlap in the computational model, but they don’t.
Vajda said that his team has developed a new protein-mapping algorithm called CS-Map (computational solvent mapping of proteins), that generates “more meaningful binding pockets” than MCSS, GRID, or other approaches. The method uses “better sampling” than these algorithms, Vajda said, and also ranks clusters of small molecules and functional groups, rather than individual conformations. This lab is in the process of a “large-scale mapping project” using the software that ultimately aims to map the binding sites of a number of key enzymes. The data from the project will be released through the PRECISE (Predicted and Consensus Interaction Sites in Enzymes, http://precise.bu.edu/) database hosted by his lab. The current version of the database is based on interactions extracted from PDB structure files.
Overcoming the Dearth of Data
In the meantime, researchers are getting around the lack of available structural data in different ways. Alexander Hillisch, global head of computational chemistry at Bayer HealthCare, discussed the use of homology modeling for generating protein structures that have not been experimentally determined yet.
Hillisch noted that the Trembl protein sequence database contains around 1.2 million entries, but the Protein Data Bank only contains around 28,000 structures. While ceding that homology models are generally “inferior” to experimental structures, they are useful for “bridging the gap,” he said.
The key for drug discovery, Hillisch noted, is the sequence identity between a target protein and its closest homolog in the PDB. Generally, he said, at least 50 percent sequence identity is required for most drug-discovery applications. [For more details on these requirements, see a Q&A with Hillisch in BioInform 1-17-05.]
As an example of this approach in practice, Hillisch described a project at Schering, his former employer, in which the structure for the ER-beta protein was derived through homology modeling using ER-alpha as the template. Molecular dynamics simulations indicated that the binding pockets of the two proteins differed only in the flexibility of two amino acids. The Schering team used this knowledge to design compounds that were selective for each of the two proteins — the first example of ligands designed using a homology model, according to Hillisch.
BMS’s Cheney offered another approach to account for the variability in protein structure in order to improve the accuracy of docking. The approach, called “protein ensemble docking,” docks a set of ligands against multiple conformations of the same protein, and combines the results into one large data set that is then rescored and reranked.
In an experiment with several conformations of the CDK2 protein and a set of 92 chemotypically diverse ligands, Cheney said that the ensemble-based approach had a success rate of 77 percent, while the success rate using a single protein conformation was 30 percent.
Cheney admitted that one “bottleneck” in the approach is that protein ensembles are difficult to generate when only one crystal structure is available for a protein. However, he noted, as a “proof of concept,” it illustrates the limitations of docking into a single rigid protein conformation.
Columbia’s Friesner suggested that even if there is only one crystal structure available for a protein, Cheney’s approach could still work using so-called “induced-fit” docking, which uses the structure of an active ligand to determine the shape of the protein binding pocket. “If you have multiple actives, you could use them to generate induced-fit structures that would generate the [protein] ensemble,” he said.
Friesner’s advice in the case when there is only one protein crystal and one active? Avoid computational approaches altogether and “do high-throughput screening.”