Skip to main content
Premium Trial:

Request an Annual Quote

NIGMS Allots $5M for New Database to House Protein-Ligand Data; Pharma to Contribute

Continuing on its path to grow open-access computational chemistry and cheminformatics tools, the National Institute of General Medical Sciences has awarded set aside information.
NIGMS said that the database, called the Community Structure-Activity Resource, will help advance computer-aided drug design methods and software development. A key aspect of the resource will be the participation of pharmaceutical firms, which will contribute unpublished data, NIGMS said.
The resource will include data from academic labs and pharma that will help scientists study a complex component of drug development: the structure and binding properties of protein-ligand complexes.
CSAR will receive an estimated $5 million over five years, the agency said. The project is in line with a goal that NIH has pursued since it launched its Molecular Libraries Program several years ago to expand publicly available cheminformatics tools and resources.
“There is an effort to try to engage the NIH-funded community more in the small-molecule world,” NIGMS director Jeremy Berg told BioInform.
Medicinal chemist Heather Carlson of the University of Michigan's College of Pharmacy will oversee the creation and operation of CSAR. Her work focuses on physical chemistry and cheminformatics, including developing methods and computational approaches and tools to study atomic-level details of protein-ligand binding. She and her team have also curated Binding MOAD, or Mother of All Databases, which contains over 11,000 protein-ligand complexes.
‘Sort of Stuck’
The intended users of CSAR are scientists who develop docking and scoring routines for structure-based drug design, Carlson told BioInform in an e-mail. “Our goal is to push the field to make significant improvements to existing approaches and spur new approaches.”
Berg said the idea arose in part from discussions between computational scientists in academia and pharma over the last several years, including several NIGMS workshops on challenges in docking and virtual screening. Representatives from multiple companies openly admitted at these meetings that “they were sort of stuck,” he said.
According to Berg, computational chemists in pharma companies would like to be able to take a database of millions of compounds and “say something intelligent about which ones were likely to bind to a target, and how well.” But methods to do that task effectively have been lacking.
“They wanted to try to get the academic community more aware of their problems and more engaged in trying to improve the methods,” Berg said.
As the 2005 workshop summary indicates, participants acknowledged that structure-based ligand discovery “had reached a plateau,” and the report for the following year’s meeting notes that progress toward computational tools for molecular docking and in silico screening “would be significantly faster” if research groups had access to common, high-quality data sets. Those datasets would be used for benchmarking algorithms and further research in docking and scoring.

“They wanted to try to get the academic community more aware of their problems and more engaged in trying to improve the methods.”

After the firms “did their homework within their companies,” said Berg, they unearthed datasets from both successful and shelved drug development series of many compounds bound to a target for which they had resolved the crystal structures. “These datasets were exactly what the academic community was lacking, of having a set of 10 or 20 different compounds bound to the same target; those are things that are very hard to get funded in academia to do,” he said.
The discussions between academic and commercial computational chemists, Berg said, led NIGMS to solicit proposals to create an academic center that would identify and obtain “the best dataset available for the computational chemistry community to really chew on and see if they could improve the methods,” he said.
While NIGMS expects pharmaceutical firms to participate, the agency has not disclosed the names of any firms that plan to donate their data. Janna Wehrle, NIGMS program officer in cell biology and biophysics, said in an e-mail to BioInform that “many [companies have] announced their enthusiasm,” and that Carlson’s team is only beginning efforts to bring companies on board for CSAR.
According to summaries of the NIGMS workshops, industry participants included representatives from Bristol-Myers Squibb, GlaxoSmithKline, Merck, Oxford Biosciences Partners, Structural Genomix, and Vertex Pharmaceuticals. BioInform contacted computational chemists at several of these companies, but they either declined to comment or did not return calls before press time.
The milestones for the CSAR are going to be measured several ways, said Berg. “Now that the resource has been funded, do they really contribute the datasets?” Over the longer term he is keen to see if the resource results in new methods and new publications with these “richer datasets.”
Ready to Jump
Carlson’s team will be working with Shaomeng Wang, another computational chemist at the University of Michigan and the originator of PDBbind, a database similar to Binding MOAD that houses experimentally derived binding affinity data for protein-ligand complexes in the Protein Data Bank.
Carlson said that because she and Wang had experience putting together other similar datasets, “it was natural for us to propose expanding in this new direction.”
“I believe CSAR has the potential to be a valuable resource for researchers aiming to raise computational methods of drug design to the next level,” said Michael Gilson, CSAR advisory board member and a researcher at the Center for Advanced Research in Biotechnology at the University of Maryland Biotechnology Institute.
“My understanding is that the pharmaceutical industry will be releasing hitherto proprietary data sets for incorporation into CSAR. If so, these releases will complement and expand existing public collections of protein-small molecule binding data,” he said. 
Gilson said that CSAR is in line with a broader trend “toward expanding public access to protein-small molecule interaction data.” 
Indeed, publicly available cheminformatics resources have flourished in the past few years, such as the National Center for Biotechnology Information’s PubChem, the University of Alberta’s DrugBank, and a resource developed in Gilson’s lab called BindingDB, a database of measured binding affinities with a focus on protein-ligand interactions.  
CSAR was designed from the start to have industry input. Carlson said that several computational chemistry researchers at pharmaceutical firms serve on CSAR’s advisory board and will be involved in developing the resource. “Once we have developed a routine, we will contact them to join the participation. It really will be a grass-roots effort of me e-mailing and calling people,” she said.
The resource is not going to solve all drug-discovery and -development headaches but stands a chance of opening some of the bottlenecks, according to Carlson. “This is intended to improve codes for drug design; designing an inhibitor is just one small step in making a drug molecule,” she said.
Scientists in the pharmaceutical industry, she said, do not have time to develop new computational code. “We all need better software, pharma included, and we need the community to come together to pool their data.” 
The name CSAR is a play on quantitative structure activity relationship, or QSAR, a classic computational chemistry technique, with which researchers use physicochemical information about ligands and their affinity to model binding to a particular target.
CSAR is going to use data to propel software development. “CSAR will be data, lots of data,” Carlson said, “Our field has kind of hit a wall and improvement has slowed. We believe we are missing the data necessary to improve our algorithms, equations, and models for drug design.”
Existing software needs to “better represent the high-quality data we will provide,” she said. While users might end up depositing software code in CSAR, she believes that improving existing code “will probably be the first line of attack, rather than scrapping everything and starting over.”
Within the CSAR project, the plan is to also run evaluations and competitions to set benchmarks in the field. “This is one of the most fun aspects for my lab. By drafting interesting datasets, we can ask some important questions about how data effects algorithmic development,” she said.
The Mother of It All
Some of the initial information for CSAR will come from a set of MOAD structures, Carlson explained. Binding MOAD contains over 11,000 complexes that have been manually curated from the Protein Data Bank and the literature. “We've read about 10,000 crystallography papers to add binding data and classify ligands as valid binders or invalid crystallographic additives.” 
Most recently, Carlson’s team has used text mining techniques to help pre-sort the content in a method they call guided reading. “This data is too complex to pull out solely from text mining,” she said. Collaborator Peter Dresslar, from bioinformatics firm TorreyPath, wrote an interface that preprocesses the crystallography papers and highlights appropriate phrases about binding data.
CSAR will not be as inclusive as MOAD but rather aims to be “highly selective,” Carlson said. “It will be a highest quality set, focused on [dissociation constants] and will contain information about some protein-ligand structures, but much more about ligands without structural data.”
Much information for CSAR will be experimentally derived, she said, as she and her team generate their own dissociation constants for select protein-protein systems. “It is the most accurate way to measure how well a molecule binds to a protein,” she said.
Carlson said she also plans to verify some of the data deposited from industry using isothermal calorimetry, which is a “very detailed” method that most companies do not use but is “arguably the best way to measure” dissociation constants, she said.
While many facets of the project are still evolving, MOAD has already taught the team that the current backend for the existing databases will not be sufficient for CSAR, mainly due to software unwieldiness, Carlson said.
“We went way too complicated on the original construction of MOAD,” she said. MOAD is based on the Java 2 Platform, Enterprise Edition, using an open-source JBoss Application Server, Enterprise JavaBeans, and a MySQL database backend. 
“We're starting over from scratch and building on the success of the National Resource for Proteomics & Pathways, which is developed by a colleague at [the University of] Michigan, Phil Andrews,” she said, explaining that this community effort has been able to effectively pull together new tools for processing proteomic data.
“We hope to see the same success, but what tools arise and where it evolves is hard to predict at this time,” she said.

Filed under