Users of the Protein Data Bank may not even be aware of some of the biggest changes the resource has undergone recently — and that’s exactly how the PDB staff wants it.
Just a few months after completing a major overhaul that included reformatting all the current data holdings, the Research Collaboratory for Structural Bioinformatics is readying the revamped PDB for a June alpha release, said PDB co-director Phil Bourne.
But while major changes are taking place under the hood of the PDB, users have had uninterrupted access and have witnessed no slowdown in deposited structures.
Bringing the 30-year-old project up to date unobtrusively wasn’t an easy task for Bourne and the other members of the RCSB — a consortium comprised of Rutgers University, the San Diego Supercomputer Center, and the National Institute of Standards and Technology. Upon taking over management of the PDB from Brookhaven National Laboratory in 1998, Bourne said the resource began as a “patchwork quilt” of software and databases from the three separate sites — not an optimal arrangement for a project hoping to provide a streamlined and easily accessible portal for structural bioinformatics research.
A further challenge came in the form of the PDB format used for data files since the inception of the database — a format that worked well enough for the 80-character punch cards it was designed for, but isn’t able to represent complete structural data in machine-readable form. While an alternative format — mmCIF (Macromolecular Crystallographic Information File) — was easy to decide upon, reprocessing almost 30 years worth of PDB legacy data and reformatting it as mmCIF files wasn’t. Additionally, only a limited number of structural bioinformatics software packages are mmCIF-friendly, Bourne said, so widespread adoption of the new format could take awhile, even though the PDB is mmCIF-based internally.
Keeping Everybody Happy
Added to the list of challenges is the fact that the PDB must meet the needs of the two very different communities it serves — the X-ray crystallographers and NMR spectroscopists who deposit their structures and the biologists who use them. Depositors wanted a fast and accurate pipeline to make their structures public, users wanted an intuitive interface and fast querying capabilities, while the PDB staff wanted ease of maintenance, improved consistency, and all want better data quality.
Somewhat surprisingly, the multi-site approach to managing the PDB worked in the project’s favor as it took on these challenges. While each RCSB institution has its own role in the project — Rutgers handles incoming data processing, the SDSC is responsible for data distribution and querying, while NIST maintains the physical archive of paper and tapes and plays a role in data uniformity — they were able to communicate effectively with each other and with the depositor and user communities to ensure a smooth transition to the improved version of the site. The collaboratory approach is “unique in that it works,” quipped Helen Berman of Rutgers, director of the PDB. “You have to know when to do something together and when not to. Cleaning up the data and making the new database required the whole enterprise to move together.”
Berman said that one of the group’s main concerns for the PDB was speeding the process of getting the data in, validated, and out to the community. The average turnaround time for this process is now around two weeks, she said, down from around 120 days several years ago. The PDB has also made a number of software tools available through the www.pdb.org site, including CIFTr, a program that can translate files between mmCIF format and PDB format.
Berman said that the redesigned database and all the tools the RCSB is developing for the resource are built upon the mmCIF dictionary (http://deposit.rcsb.org/ mmcif), an ontology of crystallographic terms. The dictionary approach is expected to improve data unification within the PDB as well as help integrate PDB files with other data sources. With this goal in mind, Bourne said the PDB also plans to publish its API so other databases can interface with it.
Bracing for the Data Blitz
The move to mmCIF may appear premature in light of the fact that current software doesn’t mesh well with the format, but considering that the PDB’s current holdings of over 17,000 structures pales in comparison to the more than 30,000 structures — including some very complex ones like the ribosome — expected to come out of structural genomics research over the next few years, the move couldn’t come at a better time.
In addition, the project is already working on its next big improvement to handle the increase in data — automating the deposition pipeline so that journal publication and PDB publication occur simultaneously. “[Journal] publication is currently the rate-determining step in the process,” said Bourne. “Why not do both at the same time?” Berman said that the PDB is already working on the software for this stage of the project, which she expects to see online in some form by this time next year.
Also in the works are plans to improve the portability of the resource. The PDB is still too complicated for most organizations to install in house, Bourne said — a fact that limits its practical use for researchers hoping to compare their own structural data to PDB holdings.
In the meantime, however, the PDB staff continues to fine-tune the beta site based on user suggestions as it prepares for its proposed June alpha release, at which time, Bourne said, “we’ll begin the whole testing phase again.”