Skip to main content
Premium Trial:

Request an Annual Quote

University of Texas Team Automates Sequence Alignment, Phylogenetic Tree Generation

NEW YORK (GenomeWeb News) – Researchers from the University of Texas at Austin have developed a new method — dubbed simultaneous alignment and tree estimation, or SATé — for estimating DNA alignment as a phylogenetic tree is constructed.

Instead of aligning sequences and generating phylogenetic trees in two steps, the team used an approach based on maximum likelihood analysis to iteratively add sequence alignment data as phylogenetic trees are generated. Based on their assessment of real and simulated data sets, they concluded that the approach is a fast and accurate way to look at the evolutionary relationships between as many as 1,000 sequences. The work appeared online today in Science.

"It's fundamentally a change in the alignment process," senior author Tandy Warnow, a computer science researcher at the University of Texas at Austin, told GenomeWeb Daily News. "The tree part of it, in some sense, is standard."

Still, some are skeptical about the new method. In a Perspectives article appearing in the same issue of Science, European Bioinformatics Institute researchers Ari Löytynoja and Nick Goldman said that while the new method is faster and more accurate than others for assessing hundreds or even a thousand sequences, it does not seem to be better for smaller data sets.

"People should not start believing that they can suddenly get accurate alignments and trees for thousands of sequences," Löytynoja told GenomeWeb Daily News.

Most phylogeny estimates involve doing two steps: researchers painstakingly align sequences and, in a second step, feed this information into a program for generating a phylogenetic tree. This process is tricky and time-consuming, Warnow explained. "That two-step process is really deeply ingrained in the community," she said, "but people are really frustrated with it."

Such analyses are particularly troublesome for very large data sets, the authors noted — or when the phylogenetic tree contains sequences with numerous insertions or deletions or from widely divergent species.

In an effort to solve some of the problems with existing methods and develop their own fast and accurate method for phylogenetic analyses, the researchers started by exploring existing methods to understand how they work and where they fail, Warnow explained.

Based on these analyses, the researchers came up with SATé, which uses a divide-and-conquer approach for sequence alignments, tweaking the alignment with each round while generating the phylogenetic tree.

"Sequences in each subset are realigned, and the alignments are progressively merged using the current tree as a guide tree, into an alignment on the full data set," the authors wrote.

They subsequently developed two SATé variants with distinct speed and accuracy characteristics.

Next, the researchers compared these SATé methods to two-phase models based on several other alignment methods using real and simulated data. While other methods tended to have more and more errors as the number of taxa or rate of evolution increased, the authors reported that "SATé24's error rates generally increased much less than those of the other methods."

Based on these comparisons, Warnow said, it appeared that phylogenetic trees generated by SATé were — on average — closer to the curated and/or reference trees than trees developed using other methods. "My guess is we will have some error — but less error," Warnow said.

In their Perspectives article, though, Löytynoja and Goldman suggest that error is unavoidable given the data size, speed of analysis, and methods used.

Although they credited the method as a "good reminder that alignment and phylogeny should not be considered in isolation," the pair said the new technique "raises some key questions and further challenges.

"What we criticize is that they try to make it so fast that they have to make significant shortcuts in the methods used," Löytynoja said. Because sequence analysis is computationally demanding, he added, trying to generate a phylogenetic tree based on more than a few tens of sequences will almost inevitably lead to approximations.

Overall, Löytynoja said, the work is a step in the right direction. But, he added, it's unclear whether this relative improvement moves the process from "horrible" to "very bad" or from "reasonable" to "rather good."

In particular, the pair also questioned the authors' use of maximum likelihood score for evaluating the alignments, since traditional maximum likelihood methods consider gaps as missing data. Rather, they argue, alignments are meant to place such gaps correctly.

"The way they did it was using fast methods that don't treat insertions and deletions as evolutionary events," Löytynoja said.

Study author Warnow also noted that the maximum likelihood method can't assess gap events. That means integrating information about insertions and deletions into phylogenetic trees can still only be done with small data sets. In the future, she said, researchers will need to find a way to include indel information in maximum likelihood models in a scaleable manner.

Still, the researchers have continued improving SATé since writing today's paper. Warnow said it's already possible to scale-up the approach to more than 1,000 sequences, though finding the space required to process these large data sets can be challenging.

The team is working to make SATé freely available to the research community through the Cyberinfrastructure for Phylogenetic Research portal. It can also be downloaded from the paper's supplementary online material, Warnow said, and will be available through the researchers' web site.