A team of researchers have developed an approach for reconstructing phylogenetic trees from whole-genome next-generation sequence data that they claim avoids the errors and biases that are associated with existing approaches without compromising accuracy.
They've made the method available in an online tool called the Reference Sequence Alignment-based Phylogeny Builder (REALPHY). It's also described in a paper that was published this week in the Molecular Biology and Evolution journal. There, the researchers explain that their method builds trees by combining alignments from reads mapped to multiple reference sequences, and they describe how this approach successfully addresses the limitations of existing methods that are used for the task.
The traditional and more time-consuming — although accurate — approach to tree building, the paper explains, involves first "assembling the reads into contigs, annotating open reading frames, identifying orthologous open reading frames across all genomes, aligning orthologous coding regions, and reconstructing a phylogenetic tree from these multiple alignments" using maximum likelihood-based approaches — examples include PhyML — or Bayesian methods such as PhyloBayes.
Other programs use a less cumbersome approach. They build trees by mapping raw reads to a single reference genome. Then, "homologous sites from all taxa — and in some studies only those sites containing single nucleotide polymorphisms — are … concatenated into a multiple sequence alignment from which the phylogenetic tree is reconstructed," the paper explained. The problem with this method is that aligning all sequences to a single reference introduces errors in the results, the researchers wrote. "For example, reads with more SNPs are less likely to successfully and unambiguously align to the reference sequence, as is common in alignments of more distantly related taxa," resulting in a bias toward more closely related organisms that "may affect the inferred phylogeny," they said.
Also, since "maximum likelihood methods explicitly estimate branch lengths, including only alignment columns that contain SNPs and excluding columns that are non-polymorphic may also affect the topology of the inferred phylogeny," they wrote. They include examples in the paper using simulated and real data that show how and under what circumstances errors occur when both of the aforementioned methods are used to build trees.
They found, for example, that using simulated data when there were no recombination events, PhyML's tree reconstructions were accurate, but "when a sufficient amount of recombination was incorporated, phylogeny reconstruction was no longer error-free." Other tests showed that methods that construct trees using just SNP position information are "unreliable," according to the paper. "We identified 131 different parameter settings [for example, tree shape and divergence] for which incorrect topologies were inferred for a fraction of the datasets, even in the absence of recombination," they wrote. Adding in recombination but still using just SNP positions produced trees that were even more unreliable, they said. Using only SNP positions as input for the trees also affects branch lengths and was less accurate for trees constructed in this fashion.
REALPHY addresses this problem, according to its developers, by "merging alignment columns from mappings to different references into a final non-redundant alignment [whilst] ensuring that each genomic position from each reference occurs in, at most, one column of the final alignment, and that conflicts between the mappings using different references are resolved."
Simply put, it uses Bowtie to map the reads to the various reference genomes — usually a selection of genomes from the major clades of the species for which a tree is being built — either in the FASTA or Genbank format, Frederic Bertels, a postdoctoral fellow in the Center for Molecular Life Sciences at the University of Basel and one of the co-authors on the paper, told BioInform. According to the REALPHY website, the system can use PhyML to infer the phylogenetic tree from the various alignments, or it can merge the individual alignments into a single alignment prior to constructing the tree, which increases the quality of the final result but is both time and RAM intensive. When they tested the approach on simulated data using the parameters that proved problematic for approaches that rely on just SNP position information, they were able to correctly reconstruct all four trees they set out to generate, according to the paper.
They also used real bacterial sequences to demonstrate the efficacy of the method using strains from organisms such as Escherichia coli and Pseudomonas syringae. Their results — which are described in detail in the paper — showed that REALPHY "performs at least as well as classical methods that are more complex and time consuming, and can even outperform these methods when it is using a larger number of sites."
The method is designed primarily for reconstructing microbial genomes — bacterial and single-celled eukaryotes — although it could, with some adjustments, potentially work for larger organisms, Bertels said. But, "such applications have not been tested yet, and … the computational resources that are required increase with the size of the input genomic data and may become prohibitive for large eukaryotic genomes that contain many repetitive sequences," according to the paper.