The Protein Structure Initiative, a public-private venture that began in 2000 to determine the three-dimensional atomic structures of all proteins, has launched a new web-based resource intended to complement existing databases by providing a single access point to all information associated with high-throughput structure-determination experiments.
The PSI Structural Genomics Knowledgebase, which aims to help scientists cross-mine a variety of information compiled by a dozen PSI research groups, went live last month. With a total budget of $2.5 million, it will enable all researchers to study proteins of interest or a particular aspect of structural genomics.
To date, PSI scientists have solved 2,800 protein structures and deposited them in the Protein Data Bank, the primary public database for protein structural information, which now holds 49,000 experimentally determined structures. The PSI Knowledgebase, for its part, will expand beyond the PDB by focusing on the experimental processes that led PSI researchers to obtain their structures, Helen Berman, bioinformaticist and chemist at Rutgers University and director of the PDB, told BioInform.
“The PSI scientists produce, of course, structures, but they also produce methodologies for doing the structures that are very, very rapid,” said Berman, who helped devise the PSI Structural Genomics Knowledgebase and is its director. The idea now is to organize the way information can be shared and mined prior to structure resolution and thus help avoid bottlenecks and scientific cul-de-sacs.
The PSI had always comprised a large number of centers “hacking along” and producing structures, said Bernhard Rupp, founder and CEO at crystallography firm QED Life Science Discoveries and one-time head of drug target crystallography at Lawrence Livermore National Laboratory. However, every center placed its data in a different location, sometimes in a different format, with different identifiers, which was “a complete mess if you wanted to do large-scale data-mining,” he said.
The idea behind the PSI Knowledgebase is to place this data, which is still “relatively incoherent,” into one location so that it can be mined more effectively,” said Rupp, who participates in a PSI-supplementing technology grant for protein crystal-harvesting robots, but is not part of a PSI center. “That is the way to gain knowledge: you data-mine and you analyze the data.”
The Knowledgebase is expected to be very useful in the experimental design process. For example, Berman said, sometimes a protein resists crystallization. “Through Knowledgebase, someone can type in a sequence, compare what two people have done thus far, compare the approaches and get hints” about ways to address the crystallization problem, she said.
This background information can help optimize a complicated and lengthy experimental process that includes DNA cloning, protein expression, crystallography, and data analysis. “We think it is extraordinary for scientists to … tell others about their work as they are doing it,” said Berman.
“It is one thing they have learned in the PSI: There is a great deal of feedback information analyzing what worked and what didn’t,” said John Norvell, PSI director.
The PSI effort involves twelve research centers and hundreds of scientists across the country, so the new portal is expected to help avoid duplication of research efforts, he said.
So far, PSI researchers have used their individual websites to disseminate information on experimental methods for producing and purifying proteins, software for data analysis, information on robotics, and details of crystallographic systems.
With the launch of the PSI Knowledgebase, this information is now available from one centralized location, Norvell said. Information exchange is especially important in a large-scale project that is building a pipeline to accelerate, even semi-automate, tasks at the same time it is producing experimental data, he noted.
Prior to the Knowledgebase, every PSI center placed its data in a different location, sometimes in a different format, with different identifiers, which was “a complete mess if you wanted to do large-scale data-mining.”
The portal was designed for the broader biology community, as well, and enables searching by protein sequence, keyword, or PDB identifier. Berman and her team decided the portal should just have one query window. “You paste in a sequence and can see all the models that can be generated, all the annotation, [see] if there are structures, [see] whatever structures exist, if there are protocols [and] the information comes to them from various sources,” she said.
The group decided to use standard open source software as the foundation for the portal, including multi-tier architecture Java/JSP for the presentation, Java/JDBC as middleware, and MySQL for the database. “We are largely integrating existing methods from PSI centers and existing tools for annotation,” Berman said. She and her team anticipate developing additional analysis and presentation tools in the future.
Berman manages the Knowledgebase architecture, but each PSI center has secure access to update information. The portal includes seven “modules” for target selection, experimental data tracking, a materials repository, models, annotation, metrics, and technology.
Berman and her team set up a network of 15 researchers to curate the modules. “We had to create a kind of sandbox where we put all of the information in a way so we could grab it when we needed it,” she said.
The Knowledgebase aggregates a number of disparate existing resources, such as PSI’s own PepCDB and TargetDB databases, several modeling sites, and more than 30 highly curated annotation databases such as the Structural Classification of Proteins, or SCOP, database; the Class, Architecture, Topology, and Homology, or CATH, protein structure-classification system; and the Pfam collection of protein families.
“We needed to work out exchange protocols with each site” in order to integrate these databases into the portal, she said.
The idea of a PSI web portal has been floated since the early stages of the PSI revealed the need for cross-mining this type of information. The National Institute of General Medical Sciences, which funds the PSI, originally issued a request for applications for the resource in 2005 [BioInform 11-10-06], but the official go-ahead to start the project didn’t come until this past summer.
In some cases, the development of the Knowledgebase required some difficult architectural decisions along with the development of data exchange protocols. For example, the PSI team decided not to press all protein structure information generated by computational methods into a new database, and instead opened the door to the modeling world by linking out to external resources.
“We provide a portal to models that the [PSI] groups have made using whatever algorithms they decided [to use], said Berman. “There are also models produced by some servers,” she said, mentioning SwissModel in Switzerland or ModBase, the modeling server at the University of California, San Francisco.
New Models, New Methods
There are many modeling groups because no one model is perfect for every question, Rupp said, noting that researchers might still want to do their own modeling even if other models have already been generated for a protein. In this case, for example, he said that the PSI portal could be useful when a protein of interest happens to be a PSI target, which would enable the researcher to use Knowledgebase to check on target progress.
“If it is apparently already in the refinement stage, I may start and contact the people working on this to let me know when they have the coordinates,” said Rupp. “Then I don’t need a model if in a few weeks I can have the true crystallographic coordinates.”
Consulting a model after the structure is completed can be helpful, too. “If a structure is found that has been modeled before, it helps you to find the mistakes a model made and adjust the modeling methods,” said Rupp.
Lance Stewart of DeCode Biostructures, a wholly owned X-ray crystallography subsidiary of DeCode Genetics, said that the PSI Knowledgebase is a good way to track overall PSI progress and can help people identify solutions to bottlenecks in technology development.
“People might not know where to even find information such as this,” said Stewart, who is also principal investigator for one of the PSI’s special research centers, the Accelerated Technologies Center for Gene to 3D Structure.
Stewart noted that many structural genomics methods are still emerging and may not yet be described in the literature. “Hopefully this will allow people to understand what is available to them even before it has made it into press in the full publication,” he said.
For example, one of his bioinformatics tools began as synthetic gene-design software called Gene Composer. In the context of PSI the software is evolving to enable better construct design for expression systems. “This is an integrated package now that is more recently being used to send information to liquid handlers that know how to move the pieces of DNA that make the constructs,” Stewart said.
Now, he said he wants to disseminate the tool as quickly as possible. It is available for PSI users and academics through the PSI Knowledgebase, but it is also being commercialized by Emerald Biosystems, a DeCode sister company.
The PSI Knowledgebase can also help researchers find annotations for proteins of interest, a facet of particular importance because many structures have unknown function. “This is extremely tricky business,” said Rupp. “Even trying to find out if a protein is structurally related to another one is completely non-trivial.”
Putting the Pieces Together
The portal integrates several nascent and growing databases developed within the PSI, including Target DB, a protein target registration database with status and tracking information, and PepCDB, which contains experimental information on protein expression, purification, and crystallization.
“In 2007 it was decided the Knowledgebase would be the integrating tool for all of these ideas,” Berman said. “All the pieces were in place and what needed to happen was to integrate it all to make a single site where you could get all this information without having to go to lots of different places.”
Rupp noted that the first phase of the PSI project, which ended in 2005, drew criticism because “there were no consistent metrics for data deposition, there was no large knowledgebase where all of this data together could be mined in a meaningful fashion.”
Today, some PSI results are converted to data in different ways, leading to inconsistent metrics. For example, Rupp said that the PSI groups use several different metrics to describe how well a protein crystallizes. “Some groups have a metric that goes from 0 to 10, a relatively fine scale; some divided it only into ‘do they crystallize’ or ‘do they never crystallize.’ So how do you consolidate data like this?”
Prior to the PSI Knowledgebase, a researcher would have had to burrow through diverse sources and look at each individual center’s database. The portal is designed to alleviate this problem by serving as a central point with common guidelines for people to present data in a consistent way for the whole community.
Rupp noted, however, that the trick to making the resource a success is to obtain consistency without coercion. “It is difficult to impose rigid rules on independent research centers,” he said.