Genome sequences are being churned out at an incredible rate these days, but functional annotations of genes and proteins lag behind. Likewise, structural genomics projects produce 3D protein structures en masse, but these do not always give a clue to their function. To fill this need, researchers at the San Diego Supercomputer Center, the Keck Graduate Institute, and the Burnham Institute received a five-year, $5.4 million grant from the National Institute of General Medical Sciences this month to build a public resource for “systematic protein annotation and modeling,” christened with the requisite — and rather unfortunate — acronym of SPAM.
Key to the project’s goal of providing functional annotation will be “improved algorithms for sequence comparison, sequence-structure comparison and structure-structure comparison,” said project head Philip Bourne, director for integrative biosciences at SDSC and professor of pharmacology at UCSD. The result will be a core resource of databases containing annotated sequences and predicted structures for proteins from many genomes, plus software and visualization tools. No other public effort is currently creating a resource on this scale, Bourne added.
In contrast to other databases that provide annotations for proteins, like SwissProt or PIR, SPAM will largely contain putative annotations based on comparisons, not experimental data. “We already have pipelines [of methods] that take open reading frames and do putative annotation on that data…and we are putting these pipelines together,” said Bourne.
About 10 people will work full-time on the SPAM resource. Gregory Dewey and David Wild at the KGI will focus on new methods for alignment using a statistical mechanics approach; Wild will also develop new methods for protein-fold and remote homolog recognition using a Bayesian network model. Adam Godzik at the Burnham Institute will concentrate on improving homology modeling tools for models with varying degrees of sequence similarity to known structures. Bourne and his colleague Ilya Shindyalov will improve database, query, and visualization tools, as well as the combinatorial extension algorithm for pairwise and multiple structure alignments.
Bourne and his colleagues, in collaboration with Ceres, a Los Angeles-based plant genomics company, have already created an Arabidopsis thaliana protein database, which they made available last month at http://arabidopsis.sdsc.edu. Combining the results from Blast-Wu, Psi-Blast, 123D+, Coils, TmHMM, and SignalP, they modeled domain structures for more than 25,000 predicted Arabidopsis proteins. “The large-scale plan is to do that level of annotation and modeling on all known genomes,” Bourne said.
But Bourne’s long-term plans reach beyond SPAM, which is expected to come online within two months at http://spam.sdsc.edu. Bourne is currently writing grant applications to build a resource called “Encyclopedia of Life,” which would integrate SPAM with other forms ot data.
— JK