NEW YORK (GenomeWeb) – Using an experimentally derived phylogeny, a team of Georgia Tech researchers has benchmarked several algorithms that reconstruct ancestral sequences.
Such ancestral sequence reconstructions (ASRs) can help researchers peer into past protein function or provide jumping-off points for protein engineering. But as ancestral sequences have been lost to time, judging how well the algorithms perform has largely relied on computer simulations.
To offer another way to validate these approaches, Georgia Tech's Eric Gaucher and his lab exposed a red fluorescent protein gene to rounds of random mutagenesis to develop a phylogeny with 19 operational taxonomic units that serve as 'modern' sequences. They then used five ASR approaches to infer 'ancient,' known sequences from the modern ones. As the team reported in Nature Communications today, each of those approaches did fairly well.
This strategy, Gaucher noted, drew upon work by the University of Texas's David Hillis and Jim Bull from the early 1990s in which they developed a phylogeny from a single virus to test phylogenetic algorithms.
"I wanted to see if I could generate an experimental phylogeny that could validate the algorithms used to infer ancient sequences," Gaucher told GenomeWeb.
He and his colleagues performed random mutagenesis PCR on a single red fluorescent protein gene, and after each round, one descendant was chosen for the next bout of mutagenesis. If, though, the researchers were creating a bifurcation in the phylogeny, then two descendants progressed onward. From this, they developed a phylogenetic tree with 17 splits and 19 'modern' descendants.
As the fluorescent protein gene sequence changed, so did the protein's color. The original gene encoded a red protein and its descendants include red, green, and blue proteins, among others. Gaucher noted that they intentionally chose a gene family that displayed a range of phenotypes, so various characteristics could evolve throughout their phylogeny.
The researchers then subjected those 19 'modern' sequences to ASR analyses — using PAML, FastML, and PhyloBayes, with or without rate variation as a gamma distribution, and parsimony — to see whether they could infer the ancestral sequences. Gaucher and his colleagues reported that all approaches largely recapitulated reality, though they were better at inferring more derived nodes than more basal nodes.
Total accuracy, they reported, ranged from 97.88 percent to 98.17 percent. Gaucher noted that they already knew from their computer simulations that the algorithms likely performed fairly well, but that they have now validated that. "It provides much more confidence," he said.
However, the algorithms sometimes gave different answers at the phenotypic level, even though most of the underlying sequence was correct, Gaucher said. He and his team synthesized, expressed, and purified proteins based on inferred ancestral sequences to gauge the inferred phenotype. From this, they reported that the approaches didn't always properly deal with homoplasy, or convergent evolution.
Gaucher said that his team's findings provide greater confidence in ASR methods, but the results also point out areas where algorithms could be honed. "We think we'll be able to improve our algorithms," he said, and "hopefully it will convert naysayers into believers that it's an accurate and legitimate methodology."