As biology experiments become more quantitative and larger in scale, the need for streamlined analysis increases. At the same time, the breadth of the data — often genome-scale, rather than gene-scale — requires more statistics to separate the biologically interesting signals from the noise. Technology developers warn us that while large-scale experiments are getting faster and cheaper, figuring out what the results tell us is becoming more time-intensive.
Fortunately, software developers have been expanding the repertoire of analysis solutions for the most common types of data and questions. Nevertheless, as bioinformaticians and programmers, we find that we are in demand now as much as ever. How can we try to bridge the gap between bioinformatics and traditional biology by making biologists more comfortable with using available computational methods and creating easier-to-use programming tools? We'd like — for selfish reasons — to have more time to think about biology ourselves and — for altruistic reasons — make biologists more independent with their computational analyses. From our perspective and experience, a little more software engineering and training seem to be steps in the right direction.
At times, a biologist may need to do what a computer scientist would find trivial: viewing, parsing, filtering, and sorting big text files. How about taking a file of mapped reads from a high-throughput sequencing run, getting all those mapped to the sex chromosomes, sorting by position, and doing a simple summary? Software or memory limitations can make even opening a file of several million rows impossible. As a result of the prevalence of "big biology," we are finding it as useful as ever to know how to access one's data on a shared storage system with a command-line interface. Getting past the initial hurdle of creating an account and logging in the first time, people can go pretty far with a simple Linux command vocabulary. Sharing our bag of tricks like grep, sort, cut, uniq, head, and tail with biologists can all of a sudden make them feel quite powerful. Every researcher's time is valuable, so it makes no sense to do something by hand if automation can generate results of the same quality, or better.
Once someone is hooked on the power of the Linux command arsenal, many of the same commands can be used with Windows (after installing Cygwin) or Mac operating systems.
The more ambitious new users can find plenty of challenges (and rewards) by creating Linux one-liners involving pipes or commands like awk or sed. These appear quite complex, but they are manageable when started with cookbook-style cheat sheets. Keeping a project's set of commands in a file makes it very easy to document and tweak the pipeline, and to remember recently learned tricks. This expertise in basic Linux command-line use is all that separates many biologists from a ton of great, publicly available bioinformatics tools.
Power of spreadsheets
Spreadsheet applications are better than ever at handling big files, and sometimes they are enough to get the job done. Instead of using the Linux "split" command to break a big file into 65,536-row chunks, we can now load a million rows at once. If data fits into a spreadsheet, we can automate lots of processing needs with the clever use of numerical and text functions. A current favorite function among our users is VLOOKUP, which lets them link data across different datasets, sort of like joining tables with SQL. Saving any output as a text file then makes subsequent processing by other tools more flexible. On the other hand, we sometimes try to keep data away from spreadsheets, as analysis methods are harder to document and replicate, and errors like pesky gene-to-date conversions are hard to fix. Also, spreadsheets are only a very partial solution for statistics, as lots of inferential statistics (to get confidence levels and p-values) are well beyond their functions. If you are using Windows and want to do real statistics from within your spreadsheet, it may be worth checking out the free application RExcel, a package that turns the familiar spreadsheet into a front-end for R.
Easy to use
Lots of great bioinformatics software is available, but perhaps most of it is tailored toward computer-savvy people, making it somewhat daunting for others to get started or to use. Becoming familiar with a command-line interface is a perfect first step. Specific applications have wide ranges of interfaces, from very useful to not at all. We like simple ones like blastall (for NCBI Blast): just type the command and you get a list and descriptions of all possible choices. But trying to format a database for Blast (with the formatdb command) is at the opposite end of the spectrum; typing formatdb produces an error message with no suggestions for correct syntax. The EMBOSS molecular biology tools take a different approach: after entering a command, you are asked a series of questions about exactly what you want to do. This is helpful to get up and running quickly, and detailed choices for generating a non-interactive command are provided in the more verbose help documentation.
When it comes time to develop our own tools, usually in Perl (although Python would be a fine alternative), we try to use interface conventions from other applications we know. We create every tool so that just running the command provides a summary of what it does, any input or output options, and a description of any input file formats. We also do this with R statistics scripts, thanks to the commandArgs function. We are often convinced we have the perfect final version of a tool, until a biologist using the script requests a change that had never occurred to us. To run a Perl or Python script, one normally needs to have the language installed, but it is even possible (as with the Perl PAR module) to create an executable that runs all by itself when clicked.
We use only a few bioinformatics tools with stand-alone graphical interfaces, but good interfaces make a tremendous difference. Behind the sequence alignment program ClustalX, for example, is an alignment algorithm that often performs worse than others, but the program remains so popular, we think, because it has such an effective interface and generates very attractive figures. We do not generally develop our own production-grade stand-alone programs, but we know enough to appreciate elegant, easily maintainable code, wrapped with an intuitive, bug-free interface. Instead, we mostly work in applied bioinformatics and find ourselves -using scripting — generally in Perl, R, or SQL — to customize public software and databases to our needs. The first generation of these scripts is designed to run on the command line, but it is not too difficult to turn them into Web applications.
Web tools for everyone
For certain bioinformatics and general needs, Web tools are the perfect solution: anyone can use them and, since the software stays on the server, it is easy to update. Several groups have created Web interfaces for tools like EMBOSS, originally created as a command-line suite. It is also easy to create basic Web front-ends for relational databases. EBI's BioMart, for example, is an elegant way to query a very complex set of databases without having to know SQL or the structure of the database. Our own database-driven Web tools have a much simpler (and less flexible) interface, but they work fine for a relevant set of query types.
Web tools are usually easier to use than their command-line counterparts, although they often do lose some flexibility as developers frequently include only the most common options. Despite being easy to use, Web tools do have some drawbacks; the program needs to complete its task while the browser is expecting it to, or risk a time-out signal. Also, Web applications can run very slowly on large files, since input and output files need to be transferred across a network. Lastly, since Web applications run on one's servers, we need to design them very efficiently if we expect a lot of use. A quick look at the Web Server issue of Nucleic Acids Research shows the wide variety of bioinformatics tools that can be run from a Web browser. One of our recent favorites is Penn State's Galaxy, which (among many other tasks) can analyze large-scale sequence data in the context of genome features.
From our perspective, changes on either side that bring together bioinformatics and experimental people are to be encouraged. We have been getting warnings about the arriving deluges of data for the past 10 years and have not gotten wiped out yet, so either the waves have been more gradual than expected or more of us are learning to swim.
Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a senior bioinformatics scientist in Fran's group.