NEW YORK (GenomeWeb) – Researchers at the University of Washington have developed a protein modeling approach incorporating metagenomic sequence data to improve structural predictions.
In a study published late last week in Science, the researchers demonstrated the usefulness of metagenomic data combined with contact-based structure matching and Rosetta structure calculations for protein modeling, and indicated that such sequencing could prove a valuable and rapidly growing source of data for powering protein structure analyses.
As the team noted, there currently exists no structural information for roughly a third (5,211) of the 14,849 protein families. Recently, however, "the increase in the number of known amino acid sequences had enabled the accurate prediction of residue-residue contacts by using evolutionary data," they wrote, adding that these predictions "have been used for a wide variety of protein modeling efforts."
Recently, structural predictions using evolutionary data and Rosetta structural prediction software were made for 58 large protein families. Structures for six of these families were subsequently generated experimentally. A comparison of these structures with the predictions found that certain limitations were apparent in the predictions, but that "Rosetta modeling guided by co-evolutionary constraints generates accurate models," the authors wrote.
Models of this accuracy, they added, "would have broad utility for framing biological hypotheses about function and interpreting mutational data, as well as for guiding experimental structure determination."
However, in an analysis exploring what levels of data are required for constructing accurate Rosetta-based models, the UW team determined that in the case of 92 percent of protein families of currently unknown structure, existing sequence data is not sufficient to enable accurate modeling.
Given this limitation, the researchers looked beyond conventional sequencing datasets to metagenome sequencing efforts, in which complex samples containing multiple different organisms are analyzed. As they noted, the inclusion of metagenomic sequencing data increases the amount of sequence data available for some protein families by up to 100-fold, which could significantly up the proportion of protein families of unknown structure amenable to co-evolution-based structural predictions.
Using metagenomic data, the team generated models for 614 protein families, providing predicted structures for roughly 12 percent of the protein families known to be without any structural information. They noted, as well, that during preparation of the study manuscript, crystal structures for proteins from five of the 614 families were published and showed strong correlation to their predictions.
Furthermore, they wrote that extrapolating from their data "suggests that in several years the majority of [protein] families will have sufficient number of sequences for accurate structure modeling."