Skip to main content
Premium Trial:

Request an Annual Quote

Rensselaer Data Mining Duo Puts Protein Structure Prediction on the Informatics Map


Mohammed Zaki and Chris Bystroff, two researchers at Rensselaer Polytechnic Institute, are applying new data mining techniques to the protein structure prediction problem. Zaki, an assistant professor of computer science, and Bystroff, an assistant professor of biology, are collaborating to build a library of protein “contact maps” —two-dimensional renderings of unique three-dimensional tertiary protein structures.

The approach places a protein’s amino acid sequence along the x- and y-axes of a matrix. Interactions between amino acids are plotted on the matrix, resulting in a distinct pattern for each protein that can be manipulated and mined like any other 2D data set. Secondary structures such as alpha helices, beta sheets, and beta turns are revealed as clusters of contacts in the 2D map. Alpha helices, for example, appear as bands along the main diagonal, while beta sheets appear as thicker bands parallel or anti-parallel to the main diagonal. Zaki and Bystroff are compiling a library of contact map profiles based on known structures from the Protein Data Bank that they believe can serve as a useful new protein structure prediction resource.

The goal is to use contact map prediction as a first step toward 3D structure prediction. Bystroff’s HMMSTR structure prediction program, a hidden Markov model-based approach that he developed with David Baker, uses the same I-sites library of sequence-structure motifs that underpins Baker’s Ro-setta algorithm. The Rensselaer team first uses HMMSTR to predict the local structural elements that make up the contact map, and then adds a data mining layer to capture non-local interactions between the amino acids and provide further insight into the tertiary structure of the protein.

The two are slowly working their way through the PDB in an effort to compile a representative set of “contact rules” for each protein family that can be used to improve the performance of their predictive methods. Just as the I-sites library has been a useful source of common motifs in short, contiguous residues, the new resource would serve as a similar record for non-local interaction patterns.

The library will eventually be made available to the public, but Zaki said the work is still too early to release. All of Bystroff’s work is available, however, at:

Other researchers are using protein contact maps to aid their structural proteomics work. For example, Gianluca Pollastri and Pierre Baldi at the University of California, Irvine, have developed a protein contact map predictor that is available at:

Zaki and Bystroff’s research, funded under a three-year, $333,928 DOE award, will appear in the IEEE journal, Transactions on Systems, Man and Cybernetics, in early 2003.

— BT

Filed under

The Scan

International Team Proposes Checklist for Returning Genomic Research Results

Researchers in the European Journal of Human Genetics present a checklist to guide the return of genomic research results to study participants.

Study Presents New Insights Into How Cancer Cells Overcome Telomere Shortening

Researchers report in Nucleic Acids Research that ATRX-deficient cancer cells have increased activity of the alternative lengthening of telomeres pathway.

Researchers Link Telomere Length With Alzheimer's Disease

Within UK Biobank participants, longer leukocyte telomere length is associated with a reduced risk of dementia, according to a new study in PLOS One.

Nucleotide Base Detected on Near-Earth Asteroid

Among other intriguing compounds, researchers find the nucleotide uracil, a component of RNA sequences, in samples collected from the near-Earth asteroid Ryugu, as they report in Nature Communications.