How can a biologist take advantage of the awesome power of bioinformatics? If you’re working on genome-scale biology, you may already be using bioinformatics. But how about your colleagues working in more traditional areas of biology? Ask a half-dozen different people how to get started, and you will get as many different answers.
We’d like to give you some ideas of how we train biologists to use computational methods. Becoming a bioinformaticist takes as much work as being an expert in any other area, but many biologists learn enough to get a good return on investment. With desktop computers and state-of-the-art free software, this is a good time to learn how to use computational methods to accelerate your biological research.
Where to start? The fundamentals of bioinformatics began decades ago with sequence analysis, and that’s still a good starting point. Combining theoretical background from a textbook and an all-purpose sequence analysis tool is a useful way to get both theory and practical experience. It's easy to experiment with how, for example, choice of a scoring matrix influences a sequence alignment. For a textbook, we prefer David Mount’s Bioinformatics, which is more approachable for many biologists than the more mathematically-based options. Other good sources of background theory are the NCBI and the European Bioinformatics Institute tutorials. For a sequence analysis tool, we like EMBOSS, which is available for desktop computers (along with Web-based versions).
Why type commands when you could just click away? Like EMBOSS, many bioinformatics tools are available on the Web, but to harness their full functionality, they’re designed to be run on the command line. This can be a daunting introduction, however, for people who have only used a computer through a point-and-click graphical interface. Nevertheless, this is an important part of bioinformatics, having the power to run an analysis with all of its options, and processing data in much larger quantities than is possible through a Web or graphical interface.
With Mac OS X, we use the Terminal to enter commands. Windows has a “Command Prompt” that’s better than nothing, but we always use Cygwin, a free suite of Linux-like tools, so we can run commands and process files just like being on a Linux computer. Even though Linux tools have nothing to do with biology, knowing how to take advantage of them saves us a huge amount of time during the slicing and dicing of data that’s often part of the real analysis we’re trying to do.
Where’s the data? Perhaps surprisingly, a large part of bioinformatics is knowing where to find the best and most up-to-date data. What’s the difference between NCBI’s RefSeq and UniGene? What if you want genome coordinates for a bunch of genes -- do you need to figure that out yourself, or can you benefit from someone else’s mapping? How about annotations for the untranslated regions of genes? We all have stories of spending a long time processing data, just to find that the final result was already available to download.
For gene- and genome-based data, some good places to start are NCBI and EBI. A look at the annual Nucleic Acids Research (NAR) database issue can uncover a wide range of data that is publicly available. With published microarray experiments, you may know the data has been made public, but finding it can still take a lot of searching of repositories, and journal and investigator sites -- sometimes to find out that it still needs a lot of processing to be useable.
What can you do with the data? Published articles you read should include some ideas for current bioinformatics analysis tools. For Web-based tools, the annual NAR Web server issue has more than 100 tools to choose from; many are also available for command line use. The articles are short, but many reference more detailed ones describing the algorithms and statistics that happen when you click the “Submit” or “Go” buttons. Almost all published software is free and available for download. Installing software can range from effortless to frustrating, but for sure it’s worth a try. After being scared away by any download that ended in “.tar.gz,” we’ve found that often all it takes is following the directions in the README file.
NCBI Blast, an easy first software download, is a great way to discover how to install very useful software, create your own Blast-able databases, and search them using commands tailored to our exact specifications. With NCBI Blast, you can download an executable file that’s essentially the program ready to run on your computer. Other software installations may require compiling source code, which can be a lot trickier. When in doubt, follow the directions and cross your fingers.
How about programming? Being a programmer is not a prerequisite to doing bioinformatics, but as you get more involved, you’ll probably come up with new ideas that can only be implemented with some kind of programming or scripting. Both terms refer to any way in which you can combine a series of commands to execute together.
First, it may be worth exploring commands in your favorite spreadsheet program, as we’ve found many useful ones for processing numerical and text data, including logical operations. Actual programming creates the option of doing something that may not otherwise be possible to do, or automating something that would be really boring to do by hand. Regardless of your reasons for wanting to learn a programming language, a small investment in learning even a little can pay off within days.
So what language? We prefer Perl, primarily because writing short programs is fast, and there is already a lot of code (like BioPerl) for analyzing biological data, so we don’t have to start from scratch. Python is another popular option, whereas languages like Java and C++ are even more powerful but take longer to learn and write. Simply finding people who will volunteer to help you learn their favorite language could aid you in making that decision. In case you’re wondering, relational databases and SQL (the language to query them) can be very powerful but generally less of an initial priority than programming.
Interested in a certain specialty? Bioinformatics has developed into a very broad field, and even bioinformaticists generally don’t know something about everything. Becoming more expert in a specific area, such as protein structure analysis or microarray analysis, may be more relevant. These emphases can be largely tackled by learning the theory and practice behind a single, albeit quite complex, suite of tools, along with the prerequisite chemistry or statistics.
Fran Lewitter, PhD, is director of bioinformatics and research computing at Whitehead Institute for Biomedical Research. This column was written in collaboration with George Bell, PhD, a bioinformatics scientist in Fran’s group.