We’ve all heard about the flood of biological data coming our way. What strategies do we have to wade through the data and use it effectively? Writing programs is a very powerful way to study genome-scale biology, but some big projects require another set of tools: relational databases. In contrast to the more generic use of the term “database,” a relational database refers to a set of related tables that have been loaded into a database management system. This system holds these related tables and lets us ask questions in a language (structured query language, or SQL) designed specifically for this purpose.
Have you ever wondered: Which genes show the most consistent or inconsistent expression? What biological process is most associated with highly expressed genes? What molecular function is associated with genes having long UTRs? OK, so maybe you haven’t — but we’re sure you could think of other even more illuminating (or perhaps just crazy) questions about your carefully collected data. If you had your favorite expression data, for example, together with related information like gene descriptions and Gene Ontology annotations, all in one database, you could readily look for patterns that might otherwise be very tricky to identify. It takes awhile to get the database built, but once you have it, it’s all set to go for lots of different types of analyses.
Why should a biologist care about this? If your team isn’t using them already, relational databases can help you in at least one of three ways:
- Ask more complex questions
- Write faster code
- Run faster analyses
These advantages will be less obvious in a small dataset (for example, one that fits onto a single spreadsheet) than a larger one (such as all the expressed sequence tags for our favorite species, along with gene annotations). Nevertheless, a single-table database isn’t a bad place to start, especially if we’re designing it ourselves. Good database design takes awhile to learn, so many people start by querying someone else’s database. This is also a good way to get started with SQL, which takes a somewhat different way of thinking than the logic of programming.
Some quite interesting questions can often be asked with a single SQL statement, instead of the longer process of writing a series of commands (in Perl, for example) which can take a good amount of time to write and debug. Even though many of these questions could also be asked programmatically, every time we want to look at the data, we’d have to (step 1) read a potentially very big file that holds and links all our data and then (step 2) select the linked data we’re looking for. With a relational database, our data are already indexed (step 1), so we can skip right to step 2 and get our answer more quickly.
The popular choices for open source and commercial database systems all include some set of SQL functionality. MySQL has become the most popular database system for bioinformatics, and it’s free for desktop and Linux systems. In the past several years, MySQL has added such features as nested queries (in which the answer to one query determines the surrounding query) and stored procedures (which lets us store custom procedures and functions for better performance). Despite some early complaints about open source database software lacking the robustness required for large systems, Ensembl, the UCSC Genome Browser, and Gene Ontology all use MySQL to run their popular Web portals. Genome browser software like GBrowse (used at SGD, Flybase, and WormBase) also works thanks to MySQL databases containing genome features and sequence. To query complex databases — or simply to get around the row limit of most spreadsheet applications — a graphical tool like the MySQL Query Browser can be invaluable. The Query Browser shows the structure of our database tables and also displays data output next to the query that generated it.
As you’d expect, it’s common to access databases on other computers. To do this, we still need the MySQL software (at least the “client”) on our own computer, but all the data can live elsewhere. The folks at Ensembl and UCSC Bioinformatics have been good enough to let anyone access their MySQL databases, but request that people don’t go overboard with monster queries or write any programs that hit their servers a gazillion times. As a result, except for a few quick queries, we usually prefer creating a local database, where a funky query that takes way too long to execute just slows down our computers instead of someone else’s.
The Ensembl, UCSC, and GO sites provide data for download already organized into tables, along with the SQL files that describe these tables. This makes setting up a local database of selected tables very easy. Once we get some experience with design issues from these databases, the next step is creating a custom database using our own or public data, like tab-delimited text files from NCBI Entrez-Gene or HomoloGene. Lots of bioinformatics data, such as files in GenBank format, may take some initial programming to organize in a database-friendly format. We may also want to remove some redundant data by splitting one file into multiple tables in the process called normalization — a totally separate process from the quantitative normalization performed on microarray data. In the end we’ll hopefully have a design (or, in database talk, a “schema”) that effectively describes the data in each table and the relationships among them. Creating a good schema requires a solid understanding of the data, which can take some careful study for inherently complex biological data. Also, the first version of our database isn’t necessarily going to be fast; we may need to modify tables by indexing one or more columns, and this can have a huge effect on query speed.
Using a relational database alone is very powerful, but things get really interesting when we access our database from a programming environment like Perl (with the DBI module) or R (with the RMySQL package). By embedding database connections and SQL in our code, we can generate a custom query on the fly, query the database, and further process the results. Using the DBI and CGI modules, a Perl script gets input from a Web form, creates a custom query, queries the database, and produces output as any sort of Web page we can imagine — all at the click of a button. A large fraction of Web tools both for bioinformatics and for other uses are built with this combination of programming and databases. For a biologist, if learning programming made us feel like we could rule the world (of biology, at least), combining that skill with database expertise will surely enhance our omnipotence.
Fran Lewitter, PhD, is director of bioinformatics and research computing at the Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a bioinformatics scientist in Fran’s group.