Alignment of multiple protein sequences is a traditional bioinformatics task that remains an effective way to investigate functionally important amino acids and phylogenetic relationships between proteins. If a given position or region of a protein has been conserved through evolution — while other parts of the protein have changed — it’s quite possible that this is for a good reason. As a result, a global multiple sequence alignment, such as between homologs, is often an early step in the characterization of a novel protein.
Two proteins (or gene sequences) can be optimally aligned on about any computer, but increasing the number of sequences rapidly turns the exercise into one that is too hard for even today’s computers. Fortunately we have a choice of several algorithms that are good at taking shortcuts to get a close-to-optimal alignment. Some of these are built into elegant applications that are easy to use and can produce publication-quality figures. Additionally, all of the ones described below are open-source and free.
First Stop: ClustalX
The application of choice for many biologists for the past decade has been ClustalX, a graphical interface for the command-line ClustalW. These Clustal applications are based on global progressive alignment, in which the alignment starts with the two most similar sequences, followed by the next most similar sequence, etc. Input can be a multiple-sequence FASTA file — but be sure to include your preferred labels as the first word of each sequence header. The interface is very intuitive, and the alignment can be created as a text file or as a very attractive color postscript image. The coloring scheme is based on both amino acid chemistry and percent identity for a column of aligned residues, so it’s easy to see the conserved region(s) of the set of proteins. ClustalX has color configuration files that can be modified by the user — to emphasize specific amino acids, for example — but in our experience it’s pretty tricky to get any serious modifications to work. To get graphics that work well in black and white, for example, we send our aligned sequences through BoxShade. Web versions of ClustalW (like the one at EBI) can produce color output, but it’s not so good for publication.
With all alignment software, if the sequences are really similar, they align beautifully. If they’re quite different, however, the alignment can look like a complete mess, especially as sequence identity falls below the 30 percent “twilight zone” threshold for protein alignments. Then the question comes up: “Are the sequences even alignable, or are we just missing the correct alignment?” It may be that only some local regions of the sequences are alignable, and forcing a global alignment obscures this. But before limiting ourselves to a local alignment (as Blast does), it’s time to try some other global alignment methods.
If Clustal doesn’t do a very good job, we head to programs like Muscle, Mafft, ProbCons, and T-Coffee. The publication of every new alignment method seems to come with evidence that it’s better than all the rest, but in our experience all four of these tools generate alignments at least as accurate as those from Clustal. Some of them may be slower than Clustal, but that’s a small price to pay for better results. Besides the usual command-line operation, these applications are also hosted on Web servers, so basic functionality is possible for all biologists. Results can depend greatly on the input data; a certain algorithm may work much better than others on a specific protein set. We’ll concentrate on some practicalities; check out the publications for details of the algorithms.
Muscle (MUltiple Sequence Comparison by Log-Expectation) and Mafft (Multiple Alignment through Fast Fourier Transform) are two effective algorithms that are also efficient enough to work with big sequence sets of more than 100 proteins. ProbCons (PROBabilistic CONSistency) and T-Coffee (Tree-based Consistency Objective Function For alignmEnt Evaluation) can be even more accurate than Mafft and Muscle, although they’re slower (and won’t work with really big inputs). Some of these programs can be run with a range of iterations/speeds, so you can experiment with how this influences your alignment. T-Coffee can perform basic alignments, but it also extends to a lot of functionality not found in other programs, like ways to compare and evaluate different alignments. Multiple alignments cannot be evaluated with the same rigorous statistics as pairwise alignments, so any quantitative measure of alignment quality is helpful. The T-Coffee tutorial explains all of this and includes lots of helpful program-independent theoretical considerations for multiple sequence alignment. For all four programs, after an alignment is saved as text, it can always be opened and visualized with a graphical interface like ClustalX or Jalview.
Regardless of which tool we use for multiple sequence alignment, especially for sets of sequences with low identity, we often try using additional biological information to verify and possibly manually edit the alignment. Several options are available, but our favorite is Jalview, a Java editor first created by Michele Clamp and improved over the past several years. Alignments (using Clustal or Muscle) or secondary structure predictions can be performed via a Web service, but we use Jalview mainly for the ability to modify a previous alignment by sliding blocks of residues from one or more sequences. This works fine for a few changes, but with more than that we may be better off trying a different alignment method or algorithm. Often we end up deleting some sequences from our set just because they’re too similar (not informative) or too divergent (potentially messing up an otherwise good alignment), and sometimes we also want to select only specific aligned regions of the proteins. Jalview has a variety of coloring schemes, including those based on secondary structure and identity threshold. Like all of these tools, Jalview can export alignments in a variety of text formats, and it can also generate attractive postscript images.
We’d love to see a tool with an interface like ClustalX where one could apply any of the above alignment algorithms. Meanwhile, we need to jump around between several applications. Since these programs are all available on the command-line and the Web, and several alignment file formats are common, this isn’t too difficult. As long as we have ClustalX or Jalview to export alignments in postscript, we can use any algorithm we want and end up with attractive, publication-quality color images.
Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a bioinformatics scientist in Fran’s group.