Skip to main content

Sifting Through Sequence


To analyze nucleic acid and protein sequences, biologists can choose from many good desktop software packages. These work well for the analysis of one or a few sequences, but when processing an entire genome, these solutions fail. To do these analyses on the command line or with scripting, computer-savvy biologists have fewer choices, but fortunately most are open source and free. This opens the possibility of building custom pipelines to analyze any number of sequences. Despite great progress in open-source software, however, every one of these packages has room for improvement.

Our favorite sequence analysis package is EMBOSS, the European Molecular Biology Open Software Suite. Another top choice for all-purpose analysis, although really a programming environment rather than a software suite, is BioPerl. Other packages devoted to more specific analysis but still of general interest include the ubiquitous NCBI Blast and Clustal.

EMBOSS and BioPerl

EMBOSS was created starting in 1997 in response to GCG’s decision to stop releasing its source code, and the first public version was released in 1999. Current release 2.9 contains more than 150 applications, including most GCG applications and many others not found in GCG. We encourage the use of EMBOSS as an all-purpose sequence analysis package for our biologists and our programmers. It has Web versions and the traditional command-line version, and though it’s best known as a Unix package, it now installs relatively easily on Windows and Macintosh computers (although graphical outputs appear to be limited).

EMBOSS reads and writes all common sequence formats, and we are spared the creation of any more application-specific sequence formats. For new users of the command-line version, we’re very happy about the option to run programs in an interactive mode, where the user only needs to type the name of the application, and then is prompted for file names as well as common options (with generally sensible defaults). For use within scripts, one may read input from standard input, stdin (with the ‘-filter’ option) or write an output file to standard out, stdout (with ‘-auto’). EMBOSS documentation states that it accepts multiple sequence files as input, yet this appears to apply to only some applications.

A complementary resource for sequence analysis is BioPerl, a Perl bioinformatics toolkit. It extends well beyond sequence analysis, but it includes many of the popular types of analyses found in EMBOSS. With many bioinformaticists already programming in Perl, it’s a great way to add biological functionality through use of the set of modules.

Like many modules, those in BioPerl (currently version 1.4) are written with an object-oriented style, which is an effective way to represent the data but quite challenging for new programmers to learn. Fortunately, the BioPerl tutorial and the growing collection of HowTo’s have lots of sample code. This partially makes up for module documentation that varies greatly in its helpfulness. BioPerl also has a module that provides an interface to EMBOSS, so one can access all of those tools from within Perl. Even more useful, however, are the output parsing capabilities of BioPerl, where one can manipulate output from a variety of other programs using the Bio::Tools and other modules. When it works, BioPerl works beautifully, but when it fails — such as from errors in a script or input data — it fails far from gracefully.

Tracking and Alignment

A large part of sequence analysis is keeping track of all of one’s sequences. Although the most powerful solution is placing parsed annotation-rich files (such as GenBank format) into a relational database like MySQL, this is usually overkill for most of our projects. EMBOSS contains a system for indexing sequence “databases” (multiple sequence files), but we usually use Blast indexing and ‘fastacmd’, since we’ve already prepared the sequence sets for Blast. With sequence sets indexed for EMBOSS or NCBI Blast, one is generally limited to obtaining sequences by accession or GI, so at times we still need to use NCBI’s Entrez interface to query by specific fields.

In addition to indexing sets of sequences, the NCBI Blast suite has an easy interface for people who want to use the command-line interface to search a sequence database. The suite works fine on desktop operating systems (as well as Unix, of course) — as long as one knows where to find the command-line options. (By the way, the secret is to end ‘formatdb’ or ‘fastacmd’ with ‘ ¯’.) In addition to ‘blastall’, the NCBI Blast client (‘BLASTcl3’) has an identical interface but searches remote databases at NCBI, a way to avoid downloads of huge sequence sets that are searched only infrequently.

Performing alignments is another common task with many choices of algorithms. We like the ClustalW/ClustalX programs since we can perform the same actions using either command-line (ClustalW) or graphical (ClustalX) interfaces. We especially like ClustalX for the attractive color postscript graphics — although adding a PDF option would be even better. But what we’d really like would be a ClustalX-like application that would let us plug in any desired alignment algorithm.

In addition to powerful all-purpose tools like EMBOSS and BioPerl, these other packages are just a few of the huge number of open-source tools designed for specific sequence analysis tasks. Our favorites do their job well but also provide effective interfaces for both the biologist and the experienced programmer. We particularly like those with output graphics suitable for publication. In the end, though, we can’t complain too much about the shortcomings of these programs, since — being open-source projects — we have the option of modifying them just how we want and sharing the results with our colleagues.

Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a bioinformatics scientist in Fran’s group.

Web links: EMBOSS: Bioperl: NCBI BLAST: download: ClustalW: download: ClustalX: Jalview: Vienna RNA Package:

The Scan

UK Funds to Stay Ahead of Variants

The UK has announced a further £29.3 million to stay on top of SARS-CoV-2 variants, the Guardian reports.

Push for Access

In a letter, researchers in India seek easier access to COVID-19 data, Science reports.

Not as Cold

Late-stage trial results are expected soon for an RNA-based vaccine that could help meet global demand as it does not require very cold storage, the New York Times writes.

Genome Research Papers on Microbes' Effects on Host Transfer RNA, Honeybee Evolution, Single-Cell Histones

In Genome Research this week: influence of microbes on transfer RNA patterns, evolutionary relationships of honeybees, and more.