The Protein Data Bank said this week that it will stop accepting theoretical models on Oct. 15 in an effort to draw clearer distinctions between experimentally determined protein structures and those predicted using computational methods.
The decision, the result of a workshop that the PDB hosted last fall, highlights some of the current limitations of computational structural biology.
While ab initio and homology modeling methods have made great strides in recent years, the phase-out of these models from the biological community’s premier protein structure database underscores the fact that these approaches are clearly no substitute for experimental methods.
The PDB’s decision also raises a number of questions regarding the best mechanisms for storing and managing theoretical protein models. While one outcome of the fall workshop was a recommendation that a new portal be created to host these models, there are no obvious candidates among existing resources for taking on that additional responsibility.
Furthermore, there is little consensus in the structural biology community regarding assessment methods, curation policies, and data standards for theoretical models — all of which would need to be resolved before creating a centralized resource.
Nevertheless, many in the field view the move as a step in the right direction. Kevin Karplus, who heads the protein-structure prediction group at the University of California, Santa Cruz, noted that the PDB will “do far more for the field by improving the amount and quality of the experimental data they have, and eliminating theoretical models from their database.”
Karplus, in an e-mail interview with BioInform, explained that “many bioinformatics applications use all of PDB as a data source, whether to know what proteins have already had their structures solved or to determine statistics about structures.” Theoretical models, when mixed in with experimentally derived structures, are “contaminants” and a “nuisance,” he said.
In fact, the PDB already took steps to sequester theoretical models from experimental structures in 2002. Since then, the main archive has only included structures determined using experimental methods — currently 38,198 structures. Theoretical models are kept in a separate location in the PDB’s FTP archive, with a separate search interface, and currently number 3,968.
Even so, the PDB’s policy regarding these models remained “ambiguous” and required clarification, according to a report of the fall workshop, published in the Aug. 16 issue of the journal Structure.
Helen Berman, director of the Research Collaboratory for Structural Bioinformatics, the consortium that manages the PDB, said that the workshop, held at Rutgers University Nov. 19-20, 2005, “brought together people from the modeling community, people who use models, such as the cryo-[electron microscopy] community, and people in structural genomics where they’re going to generate large numbers of models from the experimentally determined structures.” The goal of the workshop, she said, was to explore “what should and should not be in the PDB.”
The workshop resulted in three primary recommendations (see sidebar), with the key proposal being the elimination of theoretical models from the resource. In a nutshell, Berman said, “If the coordinates of a structure were derived from a physical sample, they belong in the PDB; otherwise they do not belong in the PDB.”
Now What?
The PDB workshop participants also called for the creation of a centralized “portal” for accessing peer-reviewed theoretical models. Berman told BioInform that this portal would be “intimately linked with the PDB.”
As described in the Structure paper, the portal wouldn’t be a centralized repository so much as a “collection of descriptions of resources, and pointers to those resources.” Even so, the proposed portal would require a data standard, and “each model should be accompanied by an estimate of its accuracy.”
The paper also recommends that “authors who use models in their publications
(either created by themselves or obtained from a modeling site) deposit these models in a publicly available archive (to be established) to ensure access for peer review.” This archive would also be accessible from the proposed portal.
Finally, the paper’s authors note, “Each model submitted to the model archive will be curated. Models and metadata will be checked for proper nomenclature and quality assessment requirements. Each model will be issued a stable, unique identifier that can be included in the publication.”
But it is still unclear who would be responsible for developing or maintaining such a portal. Several databases for computationally derived protein structures exist, such as Modbase at the University of California, San Francisco, but there is currently no clear equivalent to the PDB in the computational structural biology community.
A potential alternative may be on the horizon, however. Late last year, the National Institute of General Medical Sciences issued a request for applications for a program called “Structural Genomics Knowledgebase” under its Protein Structure Initiative, which moved into a five-year production phase last July.
The goal of PSI-2 is to experimentally solve “about 4,000 unique structures that will be used as templates for homology modeling,” according to the NIGMS. Therefore, a key role for the proposed structural genomics knowledgebase would be to support computational modeling of protein structures.
“If the coordinates of a structure were derived from a physical sample, they belong in the PDB; otherwise they do not belong in the PDB.” |
In the RFA, NIGMS urges applicants to “collaborate with the modeling community to develop filters and assessment criteria that only allow the best available models to be selected for posting. These models should have designation of quality/confidence levels and potential utilities.”
Furthermore, “The applicant should have plans for continually reevaluating and improving model assessment methods and promoting the most promising models,” and the final knowledgebase should also “provide links to modeling tools and servers developed by other groups, allowing users to generate models using these tools, and provide computational model quality evaluations and uncertainty estimations to the users.”
But Jiayin Li, program director in the Division of Cell Biology and Biophysics at NIGMS, said that the scope of the proposed structural genomics knowledgebase would likely be a bit narrower than what the PDB workshop participants proposed.
“One task for the structural genomics knowledgebase is to develop the capability to support and promote computational modeling of protein structures, but the RFA didn’t give very specific instructions in terms of the scale and extent of the user community they should be serving,” Li said. “The structural genomics community, in comparison to the structural biology community, is much smaller, and as the title [of the RFA] suggests, the knowledgebase should be primarily supporting and serving the structural genomics community.”
While Li noted that the phase-out of computational models from the PDB has created a “need for a resource to replace the role that PDB has played in the past,” he noted that “whether that will be the structural genomics knowledgebase or a different resource is unclear.”
Li said that computational methods “have the capability to produce millions of models in a relatively short period of time, and the real question is how useful, how accurate and reliable those models are.” Another issue, he said, “is how to develop better quality-assessment standards that will facilitate the evaluation and annotation of the models.”
Currently, “there’s no clear alternative for the computational protein structure model database – where and when that will be established,” Li said. “A lot of people are interested in this topic, but we don’t have a clear solution yet.”
The PDB’s implementation plan for phasing out theoretical models by Oct. 15 is available here.