One of the new features in the recent Ensembl 0.8.0 release is protein family clustering, which permits users to explore proteins in the human genome or other genomes that are in the same family as a protein of interest. This feature is possible due to the Tribe clustering algorithm developed by Anton Enright and Christos Ouzounis of the computational genomics group at the European Bioinformatics Institute.
As part of his PhD project to detect and analyze protein families in complete genomes, Enright said he found that most clustering algorithms for biology were of little use for a protein set as large as the human genome. They were not fully automatic and required manual intervention to accurately predict families, he said.
Enright turned to the Markov Clustering algorithm (MCL) developed by Stijn Van Dongen at the National Research Institute for Mathematics and Computer Science in the Netherlands to detect clusters of related items in complex graphs. Enright found that if he represented biological sequence similarity in a graph, the MCL algorithm could cluster proteins into families.
The resulting Tribe-MCL package was developed in close collaboration with Van Dongen. It compares proteins against each other using a sequence similarity search tool such as Blast, and then represents the similarity scores from this analysis in a Markov matrix. The matrix represents probabilities of transition from one protein to another — two highly similar proteins have a high transition probability and two non-similar proteins have no transition probability. The MCL algorithm then models ‘flow’ through this matrix to find clusters of related groups of proteins.
“Because of the way the algorithm models flow through the graph, it is not led astray by multi-domain proteins or by proteins containing promiscuous domains,” Enright said.
The algorithm turned out to be a good fit for Ensembl, which had been able to predict protein domains, but not protein families. A set of 100,000 proteins derived from predicted Ensembl peptides as well as known proteins from Swiss-Prot and SPTrembl was compared against itself using Blast and passed through the algorithm. The complete analysis generated 15 million sequence similarities and over 10,000 protein families, Enright said.
Enright next developed another algorithm to detect the longest common substrings between the annotations of known Swiss-Prot proteins. This provided an automatic method to transfer a consensus annotation from the known proteins to the unknown ones.
All the families and their annotations are now available through Ensembl. “The family annotations are very useful for quickly deciding a possible function for any given human gene,” said Enright.
Enright said that initial feedback for the feature has been positive, although he noted that the automatic annotation algorithm needs some improvement. “Some families have been ‘over-annotated’ and some ‘under-annotated’,” he said.
“To my knowledge this is the largest purely automatic protein family analysis of this type to be performed,” Enright said, “and I hope that users of Ensembl will find it useful.”
Enright, Van Dongen, and Ouzounis are currently preparing a paper describing the algorithm, which they expect to submit in the next few weeks.