Mohammed Zaki and Chris Bystroff, two researchers at Rensselaer Polytechnic Institute, are applying new data mining techniques to the protein structure prediction problem. Zaki, an assistant professor of computer science, and Bystroff, an assistant professor of biology, are collaborating to build a library of protein “contact maps” —two-dimensional renderings of unique three-dimensional tertiary protein structures.
The approach places a protein’s amino acid sequence along the x- and y-axes of a matrix. Interactions between amino acids are plotted on the matrix, resulting in a distinct pattern for each protein that can be manipulated and mined like any other 2D data set. Secondary structures such as alpha helices, beta sheets, and beta turns are revealed as clusters of contacts in the 2D map. Alpha helices, for example, appear as bands along the main diagonal, while beta sheets appear as thicker bands parallel or anti-parallel to the main diagonal. Zaki and Bystroff are compiling a library of contact map profiles based on known structures from the Protein Data Bank that they believe can serve as a useful new protein structure prediction resource.
The goal is to use contact map prediction as a first step toward 3D structure prediction. Bystroff’s HMMSTR structure prediction program, a hidden Markov model-based approach that he developed with David Baker, uses the same I-sites library of sequence-structure motifs that underpins Baker’s Ro-setta algorithm. The Rensselaer team first uses HMMSTR to predict the local structural elements that make up the contact map, and then adds a data mining layer to capture non-local interactions between the amino acids and provide further insight into the tertiary structure of the protein.
The two are slowly working their way through the PDB in an effort to compile a representative set of “contact rules” for each protein family that can be used to improve the performance of their predictive methods. Just as the I-sites library has been a useful source of common motifs in short, contiguous residues, the new resource would serve as a similar record for non-local interaction patterns.
The library will eventually be made available to the public, but Zaki said the work is still too early to release. All of Bystroff’s work is available, however, at: isites.bio.rpi.edu.
Other researchers are using protein contact maps to aid their structural proteomics work. For example, Gianluca Pollastri and Pierre Baldi at the University of California, Irvine, have developed a protein contact map predictor that is available at: promoter.ics.uci.edu/BRNN-PRED.
Zaki and Bystroff’s research, funded under a three-year, $333,928 DOE award, will appear in the IEEE journal, Transactions on Systems, Man and Cybernetics, in early 2003.
— BT