We all know that large-scale biology means large-scale data, but unless you’ve got a degree in computer science in your back pocket, making sense of that data can be downright intimidating. Just how much computer programming experience and skill does today’s biologist need, and what are the most useful tools to have?
“I think any one of us can see that biology itself is becoming increasingly technological,” says Dana-Farber’s John Quackenbush of the high-level ’omic experiments that are making their way into virtually all labs. “What’s really happened through genomics is that biological sciences is becoming an information science.” A physicist by training, Quackenbush has used his programming bent to help build MeV, a microarray data analysis software package.
The problem, he says, is that with this influx of data does not necessarily come an innate sense of how to deal with it. While Excel might have sufficed in the past, it’s not the best approach when it comes to effectively manipulating large, complex data sets. For that, real programming knowledge is essential.
First, Quackenbush advises students — and, indeed, all biologists — to learn SQL, or some basic database programming language. That knowledge will be extremely useful, he says, especially when you need to analyze different data sets from separate experiments against each other. Another useful tool is a scripting language like Perl or Python, which allows researchers to easily extract information from databases and reformat it to suit their needs. Quackenbush also recommends knowing some statistical programming language, such as R. And if you really want to do some serious computation, he adds that knowing C, C++, or Java is the way to go.
David Schwartz, director of the Laboratory for Molecular and Computational Genomics at the University of Wisconsin, Madison, also has experience in this area. His team has developed its own tools and algorithms for doing single-molecule measurements, and lab members regularly tap into freely available databases like the UCSC Genome Browser and the Human Structural Variation Database. When considering what biologists need to have in their toolbox, Schwartz says you can’t go wrong with “programming, data structures, statistics, and a hard-core course on bioinformatics. Basically, you need programming skills for manipulation and parsing of biologically relevant files, and their visualization. More importantly, one uses programming skills that allow you to perform statistical tests.”
But biologists shouldn’t sweat it if programming and building software aren’t on the top of their to-do lists. In fact, Quackenbush advises to first and foremost take advantage of the available resources out there, including making time to discover what’s inside those databases. “I talk to people here at Dana-Farber all the time and discover that they don’t necessarily know what databases are available, what data analysis tools are available, and even what the best way to approach a large analytical problem is,” he says. For many experiments, he adds, Excel really is all you need. And no matter how much programming you’ve picked up along the way, your best bet is to find a good partner — someone whose data management and computational skills complement a biologist’s scientific ones. “None of us can really do everything,” Quackenbush says.
Still afraid to crack a book or dial the nearest college to enroll in its Introduction to Perl class? Not to worry, says Quackenbush, a teacher who’s been at the game for many years. Learning a first programming language can be difficult, but once that’s out of the way, the semantics and how-to of coding become easier to apply to other languages. Plus, he adds, “I think all of these things are things people can learn if they’re presented properly.”