Skip to main content
Premium Trial:

Request an Annual Quote

A Holly-Jolly Season for Open Source

Premium

With so many goodies under the tree, bioinformaticists will be reveling in holiday cheer for months to come. We unwrap BioPerl and take it for a whirl.

Santa was very good to us bioinformatics kids this year. We asked for open source software and whoosh, he tossed down the chimney BioPerl, BioJava, Biopython, BioRuby, Biodas, BioMOBY, OBDA, EnsEMBL, EMBOSS, GMOD, MGED, and many more. It’s like asking for Winnie the Pooh and getting Eeyore, Rabbit, and Tigger too — with Tickle Me Elmo and Thomas the Tank Engine thrown in for good measure. It’ll take months just to open all the gifts!

But as parents and kids know too well, toys often look better on TV than in real life. You have to play with a toy to tell if it’s gonna be a constant companion or collect dust in the closet.

I decided to open BioPerl first. If you want, you can follow me while I see if BioPerl can really save the 100 Aker Wood, or if it’s just another stuffed animal.

On the Box

BioPerl is a collection of Perl programs that handle many common bioinformatics tasks. It is not a complete application — it doesn’t annotate genomes or find remote homologs — but rather a collection of software building blocks that bioinformatics programmers can use to develop such applications. BioPerl is currently focused on sequence data but has limited capabilities for maps, phylogenies, bibliographic information, and protein structure. There are plans to expand into more areas, including microarrays.

BioPerl is a community software effort. The current team of core developers consists of Ewan Birney and Heikki Lehvaslaiho of the European Bioinformatics Institute, Chris Dagdigian of BioTeam, Hilmar Lapp of the Novartis Research Foundation, Jason Stajich of Duke University, and Lincoln Stein of Cold Spring Harbor Laboratory. The BioPerl website lists an additional 41 major contributors.

The first public release, BioPerl 1.0, was in May 2002 and the next, BioPerl 1.2, is planned for this month or possibly early January 2003. (It has become customary in the open source community to use even release numbers e.g., 1.0 and 1.2, for stable, public versions, and odd release numbers for less stable developer versions. Additional numbers are tacked on for bug fix releases, e.g., 1.0.2 is the second bug fix release for version 1.0).

BioPerl is a huge pile of software, at least by Perl standards. Version 1.0 contains about 450 Perl modules, version 1.1 has grown to almost 500 modules, and it’s likely that version 1.2 will be slightly larger. Version 1.1 totals about 120,000 lines of Perl program text.

The system has about 15 major parts, most of which deal with aspects of sequence manipulation. It can read and write sequences in a variety of formats, including GenBank, EMBL, FASTA, and others. It can retrieve sequences from indexed files stored locally as well as from databases accessible over the web. It can work with databases of sequence annotations expressed in General Feature Format; these databases can be implemented in AceDB or MySQL. (GFF is a text file format developed by Richard Durbin of the Sanger Institute and David Haussler of the University of California, Santa Cruz, for representing genome annotations and such.)

BioPerl has building blocks for creating graphical displays of annotated sequences similar to those provided by the major genome browsers. It has wrappers from many sequence analysis programs, including BLAST, HMMER, Sim4, and all of EMBOSS. It can work with alignments generated by these programs or imported from other programs.

Toy Story

I dragged BioPerl across the playground with me for several days, in part to write this article, but also to evaluate for a real project at work. I started with the public release (1.0.2) but moved to the developer release (1.1.1) because it had some new features I needed. I also chatted with Lincoln Stein, who provided some great tips, and I interacted by e-mail with Stein and a couple of other core developers who helped with particular issues.

I should tell you about my real project so you’ll understand the context. It involves curation of sequence data for a set of genes involved in a couple of diseases we’re studying. The first step is to collect all available sequences for these genes — genomic, full-length mRNA, reference sequences from NCBI, ESTs, and protein — for human, mouse, rat, and several lower organisms. We scrutinize all the available sequences and choose or construct one that we think is right.

This seems a perfect application for BioPerl. It’s mostly sequence hacking with some map work to check that human and mouse orthologs live in the expected syntenic regions. There’s a lot of database access to pull in the sequences of interest, and a lot of alignment work to see how the sequences relate. The dataset is not terribly large (a few hundred genes at most), so performance is not a big concern. Our methods are still in flux, so ease of use and flexibility are a major issue.

As Seen on TV

BioPerl works. It’s not the glitziest toy in the store, but it does what it claims to do in a highly competent manner. That may not sound like a lot, but anyone who’s worked with new software will appreciate how refreshing this is.

It’s also easy to use (for programmers, of course) thanks to extreme Perl wizardry by the developers. For example, it takes just three lines of Perl to convert a GenBank file into FASTA (or any other supported format). It takes five lines to read two sequences in any supported format, align them, and print the result, and about 10 lines to read a file of annotated genomic contigs and extract the genes. And these programs aren’t just short — they are also quite natural.

It’s pretty well written, too. Part of the fun of playing with open source software is that you can open the case to see how it works and learn new tricks. BioPerl is great for this.

Less fun, but extremely valuable, is the ability to examine the software while tracking down bugs. I found a fair number of bugs in the developer release — I’m not complaining, mind you, that’s what a developer release is for. None was serious and all were easy to fix or work around.

I know from sad experience that GenBank parsing is harder than it looks, because many GenBank entries break the rules in subtle ways. I did a quick test of BioPerl’s GenBank parser by feeding it the entire primate division of release 131.0. I compared Bio-Perl to an old handcrafted parser I wrote a few years ago, making sure each program found the same number of entries, and each got the same accession numbers and sequences. This is not a terribly rigorous test, but it’s something. The BioPerl parser spat out warnings for about four or five entries, but there was only one entry where its answers disagreed with my parser. Upon review, this turned out to be a bug in my code.

Batteries Not Included

Now for the bad news. Though Bio-Perl is easy to use, it’s hard to learn.

This is partly a documentation problem. Most of the documentation is automatically generated from inline commentary in the software itself, and is organized into a hierarchy that exactly reflects the structure of the software. So, to find something, you have to know which module to look in and where that module lives in the software hierarchy. This is a pretty big hill for a newbie to climb.

There is some additional documentation available in the form of tutorials and a short online course. This is good stuff, but it only gets you to the starting gate. You learn how to write the three-line file converter, but not how to use the software in detail.

The problem is compounded by a lack of coordination among the parts of the system, which results in different modules having overlapping capabilities. For example, several parts of the system deal with sequence features and related concepts. These include the SeqFeature family of modules (which is the official home for this capability) and its partner, the Location family. Some of Location’s basic capabilities are provided by a family of Range modules, and more advanced capabilities come from the Coordinate group of modules. The part of the system that deals with GFF databases also provides feature-like concepts, including some of the advanced capabilities in Coordinate. Gene-related features are supported by a collection of Gene modules that are subordinate to SeqFeature and a separate group that are part of the LiveSeq family.

This lack of coordination is a natural outcome of how BioPerl was developed. Different developers created the major module groups and endowed them with the capabilities needed for the problem at hand. Once the software is written, it’s a pain (and seemingly a waste of time) to remove redundant capabilities. But this creates a huge learning problem, since a new user has to understand many parts of the system to figure out which module to use for a given task.

A second major concern is performance. As everyone knows, Perl is not exactly fleet of foot, and BioPerl adds a lot of overhead to achieve flexibility and ease of use. In the GenBank parsing test mentioned above, BioPerl needed 16 minutes to parse the first file of GenBank’s primate division, compared to only two minutes for my handmade parser (also written in Perl). This was on a cheap 700 MHz Pentium III. This is not a real apples-to-apples comparison, because my parser does a lot less, but it gives a sense of the performance hit.

The Verdict

In the words of Thomas the Tank Engine, BioPerl is a really useful engine. It solves so many everyday bioinformatics chores — reading sequences, accessing databases, running tools — that it’s destined to be a constant companion.

It’s easy to use once you get used to it, but a bear to learn, so make sure you allocate enough startup time after you open the box. Performance is also a concern — I’d worry about using it to process all of GenBank — so do some performance testing before committing to a big project.

My biggest complaint is that BioPerl isn’t complete. It doesn’t do everything I want. Hmm… since it’s an open source community project, that sounds like a problem we can all help fix.

 

Keeping Track of Your Toys

 

I was really disappointed by BioPerl’s database capabilities — more specifically, its ability to create and work with local databases. It doesn’t provide a general persistence mechanism: you can’t load a complicated mess of data into BioPerl and save this to a database in a straightforward manner. BioPerl’s mindset is that databases come from the outside world, and its job is simply to access them.

The best-developed database capability seems to be the component that deals with sequence annotations expressed in GFF format. This component is used by Lincoln Stein’s generic genome browser, GBrowse, which is part of the Generic Model Organism Database project. You create the database by loading GFF-formatted text files. If you set up the files correctly, you can then use BioPerl to retrieve annotated sequences from the database.

A major strength of BioPerl is its capacity to deal with indexed flat-file databases. If you have a file of GenBank sequences, BioPerl can build an index for it, and thereafter you can quickly retrieve entries by accession number and other search terms.

This leads to a compound database strategy in which you put data to be browsed into a GFF database, while leaving more structured entries in indexed flat files. This is a bit kluge-y, but workable for small projects.

Another database option is bioperl-db. This is not part of the official release, but is available for download on the BioPerl site. Yet another choice is the Open Bioinformatics Database Access standard under development by a number of the open-bio groups. And, of course, you can always roll your own. — NG

 

BRINGING JOY TO THE WORLD

Organization

Description

URL

BioPerl
BioJava
Biopython
BioRuby

Bioinformatics modules in various programming languages: Perl, Java, Python, and RUBY

The groups talk to each other, but make no effort to be “equivalent”

http://www.bioperl.org/
http://biojava.org/
http://biopython.org/
http://bioruby.org/

BioCORBA Tools for accessing biological services on the Internet using CORBA

http://biocorba.org/

Biodas

Distributed Annotation System: tools for sharing and integrating sequence annotations

http://biodas.org/

BioMOBY

Tools for accessing biological web services

http://biomoby.org/

GFF

General Feature Format: text file format for representing genome annotations and such

http://www.sanger.ac.uk/Software/
formats/GFF/

EMBOSS

Large collection of open source sequence analysis programs, developed by many authors, but centered at the Wellcome Trust Genome Campus

http://www.emboss.org/

EnsEMBL

Genome annotation system developed by the European Bioinformatics Institute

http://www.ensembl.org/

GMOD

Generic Model Organism Database: a joint effort of WormBase, FlyBase, MGD, SGD, and TAIR to develop common software for community databases

http://www.gmod.org/

MGED

Microarray Gene Expression Data Society: standards for microarray data

http://www.mged.org/

OBDA

Open Bioinformatics Database Access: coordinated effort by several open-bios to create a common sequence database format. This will allow programs written in BioPerl, say, to access databases created in BioJava

http://obda.open-bio.org/

Open Bioinformatics Foundation

Umbrella group for many open-bio efforts

http://open-bio.org

OpenInformatics.Org

Open source bioinformatics advocacy group

http://www.openinformatics.org/

Bioinformatics.Org

Provides computing resources for open source bioinformatics projects

http://bioinformatics.org/

Free Software Foundation

Pioneers of the free software movement; runs the GNU software development project which develops a lot of commonly used software, including most of Linux except the kernel

http://www.fsf.org/

Open Source Initiative

Coined the phrase “open source.” More pragmatic than FSF and actively courts for-profit companies

http://www.opensource.org/

References
Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka ED, Wilkinson M, Birney E. The bioperl toolkit: perl modules for the life sciences. Genome Res 2002 Oct;12(10):1611-8. PubMed: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=12368254&dopt=Abstract