Skip to main content
Premium Trial:

Request an Annual Quote

Homology: Genealogy for Genes


The great power of model systems in molecular biology has been apparent ever since early researchers used bacteria, yeast, worms, and flies to learn about the human body. More recently, the power of comparative genomics has been harnessing evolution to help identify the most obviously important parts of our DNA by linking a piece of one genome to a corresponding piece of another genome.

On a molecular level, both of these approaches can require that we link genes that are homologous, i.e., share a common evolutionary origin. Even in a junior high school biology class, one can very easily define homology as features (such as genes, proteins, or even structures) that arise from the same ancestral entity. Devising and applying an operational definition of homology that is practical and comprehensive, however, keeps a lot of biologists and bioinformaticians quite busy. Here we discuss some methods of assigning homology, along with some of our challenges that show why this is a problem that can't be effectively addressed with basic sequence comparisons.

Homology has been described by David Wake as "the central concept for all of biology." As bioinformatics people, we're often asked to identify the homolog of a human disease-causing gene in another species, whether that be mouse, zebrafish, yeast, or another model system. Just about any molecular biologist can now use Blast to take a human protein to search a database of, for example, zebrafish proteins to identify the most similar one. Is the top hit the homolog we're looking for? We can't be sure, and this gets at the crux of the definition; homology — or more specifically, orthology (separated by speciation) and paralogy (separated by gene duplication) — is a hypothesis that reflects a history of shared origin that can be supported but not unequivocally proven. We can quantify similarity between proteins or gene sequences using percent identity, length of alignment, or even domain structure, but we can't quantify homology; either features are homologous or they aren't.

Homology is commonly interpreted to mean present in the last common ancestor, so even if all proteins evolved from the same good bits of primordial soup, knowing this distant shared ancestry isn't so useful. To assign pairs of homologs A and B across species, genome-scale analyses often go at least one step further than our Blast search above, requiring that B is the most similar protein to A and vice versa. If we find an orthology pair like this, we're in good shape, but do we want to further restrict our measure of similarity? What if we can generate only a short local alignment? What if a similar analysis of gene sequences is inconsistent? What if different scoring matrices produce different results? We'll probably try to optimize the details of our homology search for the specific use of these data, but our homology assignments will still be open to debate. Also, unless this homology is well established, we'll want to make sure to explain our operational definition.

Digging deeper

What if our operational definition of homology doesn't turn up any orthologs of our favorite human gene using reciprocal Blast search? Has the missing gene just not been sequenced or annotated yet? Or is it actually missing from the genome of our favorite model organism? How about if the human gene is present by name in the other species? All of these possibilities may need investigating. It would be much easier for us if orthologs had the same names in different species, but even some genes that do have the same names don't appear to be orthologous, now that we know more complete gene catalogs. Some of these cases fit into the unfortunately-named category of "functional homologs" which are proteins with similar functions but not of shared evolutionary origin (and therefore not actual homologs).

It would be very convenient for biologists and database administrators if all orthologs were clearly 1:1 where, for example, one human gene is orthologous to one mouse gene. If analysis of the mouse genome shows that virtually all human protein-coding genes have mouse orthologs, then why is it so hard to link every human gene to a mouse gene? Gene duplication and subsequent divergence, giving rise to paralogs, can make determination of orthology much trickier. If we discover two obvious human paralogs and two mouse paralogs, all of which appear to be homologs, how can we figure out which mouse gene is the ortholog of each human gene? If gene duplications occurred after speciation, then there may not be any 1:1 orthologs.

These 1:many or many:many homology relationships create extra challenges for comparing genome-scale datasets across species. On the other hand, if it appears that a gene duplication event occurred before speciation, we can try to resolve multiple homologs into 1:1 orthologs. All of this can be done better now than ever before, in part thanks to improved genome assemblies and gene sets. This can reduce strange observations, such as a recent look at a collaborator's favorite gene in a fish. The fish genome assembly and gene prediction pointed to this gene's presence in a set of a couple dozen highly similar paralogs, a degree of gene expansion that was absent in other species. Further investigation led to the much less interesting explanation that the expansion of this repeat-flanked gene was very recent, having just occurred in the most recent genome assembly.

Biomedical researchers who use mammalian model systems have a much easier time identifying homologs of human genes than others who experiment on worms, flies, and yeast. First, the genes themselves have had much less time to diverge, so the orthologs are much more similar. Second, the genomes have had much less time to diverge, so chromosomes have much longer conserved syntenic blocks. As a result, if mystery gene B is flanked by genes A and C in human, each of which have clear orthologs A' and C' in mouse which are close to each other on the same chromosome, we can look between these mouse genes to try to find B'. This conserved surrounding genome environment is stronger evidence, in addition to protein and/or gene similarity, that genes are really homologs and not just similar genes. On the other hand, alignment of genomes is not a solved problem, and alignment gaps do not always mean lack of homology.

The most powerful current methods use information from multiple species at once, and this orthology determination benefits from the ever-growing number of genome assemblies. The Ensembl project, for example, leverages the power of comparative genomics by using Blast to search with each gene against all other genes (species by species), clustering the similar sequences, building multiple sequence alignments, and then generating phylogenetic trees which can be compared to a species tree. The use of sequences at different phylogenetic distances helps resolve a lot of cases that would be difficult to figure out with only a pair of species at a fixed distance. This brings up the final way to determine homology: consult a reliable database that has already done the best possible large-scale homology analysis. These databases aren't foolproof, but they're a great place to start.

Just as with other scientific statements, we shouldn't believe everything we read or hear about an inferred homology relationship. If they're important to us, we probably need to investigate the genes further so we can hopefully convince ourselves and others: either they are homologous or they aren't.

Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a senior bioinformatics scientist in Fran's group.

The Scan

Y Chromosome Study Reveals Details on Timing of Human Settlement in Americas

A Y chromosome-based analysis suggests South America may have first been settled more than 18,000 years ago, according to a new PLOS One study.

New Insights Into TP53-Driven Cancer

Researchers examine in Nature how TP53 mutations arise and spark tumor development.

Mapping Single-Cell Genomic, Transcriptomic Landscapes of Colorectal Cancer

In Genome Medicine, researchers present a map of single-cell genomic and transcriptomic landscapes of primary and metastatic colorectal cancer.

Expanded Genetic Testing Uncovers Hereditary Cancer Risk in Significant Subset of Cancer Patients

In Genome Medicine, researchers found pathogenic or likely pathogenic hereditary cancer risk variants in close to 17 percent of the 17,523 patients profiled with expanded germline genetic testing.