Skip to main content

Rensselaer Data Mining Duo Puts Protein Structure Prediction on the Informatics Map


Mohammed Zaki and Chris Bystroff, two researchers at Rensselaer Polytechnic Institute, are applying new data mining techniques to the protein structure prediction problem. Zaki, an assistant professor of computer science, and Bystroff, an assistant professor of biology, are collaborating to build a library of protein “contact maps” —two-dimensional renderings of unique three-dimensional tertiary protein structures.

The approach places a protein’s amino acid sequence along the x- and y-axes of a matrix. Interactions between amino acids are plotted on the matrix, resulting in a distinct pattern for each protein that can be manipulated and mined like any other 2D data set. Secondary structures such as alpha helices, beta sheets, and beta turns are revealed as clusters of contacts in the 2D map. Alpha helices, for example, appear as bands along the main diagonal, while beta sheets appear as thicker bands parallel or anti-parallel to the main diagonal. Zaki and Bystroff are compiling a library of contact map profiles based on known structures from the Protein Data Bank that they believe can serve as a useful new protein structure prediction resource.

The goal is to use contact map prediction as a first step toward 3D structure prediction. Bystroff’s HMMSTR structure prediction program, a hidden Markov model-based approach that he developed with David Baker, uses the same I-sites library of sequence-structure motifs that underpins Baker’s Ro-setta algorithm. The Rensselaer team first uses HMMSTR to predict the local structural elements that make up the contact map, and then adds a data mining layer to capture non-local interactions between the amino acids and provide further insight into the tertiary structure of the protein.

The two are slowly working their way through the PDB in an effort to compile a representative set of “contact rules” for each protein family that can be used to improve the performance of their predictive methods. Just as the I-sites library has been a useful source of common motifs in short, contiguous residues, the new resource would serve as a similar record for non-local interaction patterns.

The library will eventually be made available to the public, but Zaki said the work is still too early to release. All of Bystroff’s work is available, however, at:

Other researchers are using protein contact maps to aid their structural proteomics work. For example, Gianluca Pollastri and Pierre Baldi at the University of California, Irvine, have developed a protein contact map predictor that is available at:

Zaki and Bystroff’s research, funded under a three-year, $333,928 DOE award, will appear in the IEEE journal, Transactions on Systems, Man and Cybernetics, in early 2003.

— BT

Filed under

The Scan

Not Kept "Clean and Sanitary"

A Food and Drug Administration inspection uncovered problems with cross contamination at an Emergent BioSolutions facility, the Wall Street Journal reports.

Resumption Recommendation Expected

The Washington Post reports that US officials are expected to give the go-ahead to resume using Johnson & Johnson's SARS-CoV-2 vaccine.

Canada's New Budget on Science

Science writes that Canada's new budget includes funding for the life sciences, but not as much as hoped for investigator-driven research.

Nature Papers Examine Single-Cell, Multi-Omic SARS-CoV-2 Response; Flatfish Sequences; More

In Nature this week: single-cell, multi-omics analysis provides insight into COVID-19 pathogenesis, evolution of flatfish, and more.