A new homology prediction method developed by researchers at Carnegie Mellon University has raised questions about the applicability of commonly used sequence similarity tools such as Blast for analyzing the evolution of multidomain proteins.
The method, called Neighborhood Correlation, was developed specifically to deal with the challenge of multidomain proteins — proteins comprised of multiple sequence segments. While these proteins represent around 40 percent of the proteome in metazoans, they present a hurdle for current homology analysis tools because it is difficult to determine whether a common sequence is the result of shared ancestry or of domain insertion.
Neighborhood Correlation, described in a recent PLoS Computational Biology paper, relies on a sequence similarity network that is weighted to give gene duplication and domain insertion very “neighborhood structures,” which enables the method to distinguish true homologs from domain-only matches, according to the authors.
In the paper, Dannie Durand, a computational biologist at Carnegie Mellon, and colleagues demonstrated that the method outperformed sequence similarity methods like Blast and Psi-Blast against a curated benchmark data set of sequences known to share common ancestry.
“The paper really tackles a fundamental problem that people have been hoping to avoid,” said David Haussler, director of biomolecular engineering at the University of California at Santa Cruz and Howard Hughes Medical Investigator. “No one has really faced it head-on before so I commend [Durand] for [that]. … She puts her method up against some others and demonstrates better performance.”
Haussler noted that current methods either filter out multidomain proteins or ignore them, which can lead to spurious results. “If the same domain is inserted into two different proteins, that doesn’t make them related — it makes the little domain part of the proteins related, but it doesn’t make the proteins as a whole related. And the aligner will blissfully align those two things and suggest that these two proteins might be related,” he said.
Neighborhood Correlation takes a geographical view, looking at the genomic neighborhood, Durand explained to BioInform. “Basically we make a network … in which every dot or node is a sequence … a line between two dots means there is a meaningful Blast score,” she said. “If the neighborhoods are similar, the genes are related, if they are not similar, we say the genes aren’t related.”
The method adds a step to a normal Blast alignment in order to look at genes in this geographic context. The scientists also developed a visualization tool for their method but have not published it yet as they are still working the “the kinks out,” said Durand. She said that this visualization tool is built on Google Earth and that she plans to make it generally available for researchers to navigate their data.
The genomic network is organized such that it tells the evolutionary history of the genes in a given region, enabling scientists to ask other questions concerning homology, Durand said.
She added that Neighborhood Correlation expands the possibilities for the genomic analysis of modular gene domains by distinguishing multidomain homologs from unrelated sequence pairs that share a domain. It adds a statistical evaluation step to alignment with Blast or PSI-Blast and “gives pairwise scores … [offering] an additional parameter for triage.”
By looking geographically at this network, “we are able to separate pairs that are related and pairs that aren’t, where the more traditional methods haven’t been able to do that,” Durand said.
“Users who will find this [method] most useful are people who have to do large-scale comparison, [who] can’t look at every pair individually, and [need to] make a judgment call,” she said. Scientists running Blast will find “the additional cost of running the neighborhood score is very low,” she said.
Building a Benchmark
To test the method, the Carnegie Mellon team created a hand-curated benchmark data set of 1,577 sequences from 20 families of known homology, covering a range of protein functional categories such as neural development, immune response, signal transduction, and enzymes. They created two sets of sequence pairs, constructing a test set of pairs with 853,465 known positive examples for homology and 40,459,204 negative ones.
The goal in building the benchmark was to have a trusted set of both known sequence pairs that are homologous and sequence pairs that do not share ancestry. The Carnegie Mellon team set out to create it so that it would test a range of homology detection problems, in single-domain as well as multidomain families.
“You’re putting your head in the sand if you treat all homologies as being simple in the sense that they cover the entire extent of the protein.”
The evidence for homology in the set was curated from the scientific literature, mainly by Nan Song, the study’s first author.
“Although comprehensive datasets are available for testing methods for predicting homology of individual domains, we are unaware of any other gold-standard dataset of known multidomain families with variable domain architectures,” the authors wrote in the paper.
“My prediction is that the lasting contribution of this paper will be from establishing this benchmark dataset, defining the problem rigorously, identifying it, and setting up a framework that others, I am sure, will follow, carefully testing their algorithms,” said UCSC’s Haussler.
In a comparison using the entire benchmark data set, the scientists state that Neighborhood Correlation “dramatically” outperformed Blast, PSI-Blast, and Domain Architecture Comparison, or DAC, a method that compares sequences based on their domains.
In a subset of the benchmark set, the kinase family, Neighborhood Correlation yielded better results than Blast and PSI-Blast and slightly worse results than DAC.
For individual protein families, Neighborhood Correlation outperformed all three methods, perfectly classifying twelve out of 20 families. Blast and DAC perfectly classified seven families, PSI-Blast perfectly classified eight.
PSI-Blast did well with single-domain families but performed poorly on complex multidomain families and sequences with promiscuous domains, which are domains that move around a great deal, the authors noted.
The scientists believe Neighborhood Correlation delivers the most reliable and consistent performance on large heterogeneous datasets, and is “particularly well suited to automated genome-scale analyses which require that a single classification threshold be suitable for the vast majority of sequence pairs in a genomic dataset,” they wrote.
As an example of the limitations of sequence-similarity methods, the authors cite a homologous pair of proteins, PDGFRB and PRKG1B, and a pair that shares domains but does not share common ancestry, PDGFRB and NCAM2.
Based on pairwise alignment, both sets of proteins appear to have similar properties in terms of Blast scores, alignment length, and the number of shared domains — results that would leave such methods unable to distinguish the related pair from the unrelated pair. However, the neighborhoods of these pairs in the weighted sequence similarity network are “very different” the authors note. For example, the shared neighborhood of the kinase homologs PDGFRB and PRKG1B is 779 sequences and the pair has a Neighborhood Correlation score of 0.65, while the shared neighborhood for PDGFRB and NCAM2 is 242 sequences and the pair has an NC score of 0.29.
“Unlike sequence comparison, this clear difference in neighborhood structure can be used to recognize multidomain homology,” the authors write.
“Maybe pairwise comparison is just never going to tell us [about homology] in a consistent way,” Durand said. “So let’s not just ask these pairs, let’s ask their friends, too.”
Not Always Family
Identifying homologous genes that encode proteins is a task with many pitfalls: similar sequences may be the result of convergent evolution, in which they have evolved independently to fulfill the same function. That doesn’t automatically mean relatedness. Or proteins might be related but have become so dissimilar that their common ancestry is almost undetectable — a problem that scientists address by looking at protein structure. “Even after the sequences no longer have any similarity, the [protein] structures do,” said Durand.
Multidomain proteins were long considered an oddity, but whole-genome analysis has revealed how prevalent they are — making up approximately 40 percent of the metazoan proteome.
Domains shuffle over the course of evolutionary history, and some move around extensively and are called promiscuous domains. “Those are the domains that are inserted [into genomes] a lot,” Durand said.
These promiscuous domains can lead to significant sequence similarity but carry little information about gene homology, the scientists point out in their paper. Even in protein families with conserved domain architectures, “promiscuity can confound reliable detection of homologs,” the team wrote.
Domain shuffling in general presents a can of worms when searching for an ancestral genome in comparing sequence pairs. As UCSC’s Haussler explained, ideally one might like to work backwards in time to an ancestral protein. But not all of the parts of a protein share the same ancestral history. “You’re putting your head in the sand if you treat all homologies as being simple in the sense that they cover the entire extent of the protein,” he said.
While this is a “simple picture that most people have tried to adopt,” it “ignores the fact that pieces of it can have been inserted from other places independently in one lineage or the other,” he said.
Neighborhood Correlation approaches the homology challenge of multi-domain proteins by ruling out two unrelated genes that happen to have the same inserted domain because they don’t help build a comparative map, said Durand.
If genes look like homologs, then the genes in the region to the left and right might also be homologous, she said. “Basically, what you are saying when you do that, is that a gene is a witness for the history of the genomic region around it,” Durand said.
“I don’t think the neighborhood correlation is the only way to approach this, … but the importance of the paper to my mind is that Dannie Durand has really focused us on the critical problem with these protein homologies,” said Haussler.
Martijn Huynen, a researcher at the Center for Molecular and Biomolecular Informatics at the Nijmegen Centre for Molecular Life Sciences who studies the evolution of genomes and protein families, agreed that the Neighborhood Correlation approach could be useful. However he is doubtful about trying “to push these proteins into classification schemes, [such as] ‘these are homologous genes and these are not homologous genes,’” he said.
His concern is that the evolutionary detail about the proteins may get lost. In some cases it may be necessary to not use the term “homology” but rather to state which gene domains are shared and which are not.
Haussler noted that while looking at network connectivity of proteins is a “reasonable way to resolve” the problems with multidomain protein homology that many people “try to shove under a rug,” the approach “won’t be the last word” in the field.
“There are a number of approaches and I am anticipating that the authors of these methods will come back with their own take, that this will be a lively area of investigation,” he said.
Homolog predictions based on the Neighborhood Correlation method and a web-based visualization tool can be found here.