Do university programs equip new bioinformaticists for the real world? Nat Goodman investigates.
It’s back-to-school time. Let’s join the students, metaphorically speaking, and spend an imaginary year or two at Bioinformatics U. It’s not a party school, that’s for sure. You’re going to learn backpacks full of stuff from fields like biology, biotechnology, computer science, software engineering, mathematics, and statistics, as well as bioinformatics itself.
I’m going to review the curricula of several bioinformatics graduate programs and then ask how well these programs train students for the real world. As we’ll see, most programs lack a few key elements, and it may be necessary for industry to fill in the blanks.
A decent proposal
A good starting point for thinking about bioinformatics curricula is “A Curriculum for Bioinformatics: The Time Is Ripe,” published by Russ Altman in Bioinformatics in August, 1998. Altman, who directs Stanford’s Biomedical Informatics Training Program, proposed a roughly master’s-level curriculum including topics from biology, computer science, math and statistics, ethics, and what he called “core” bioinformatics.
From biology, he proposed that students be taught “the basic theoretical constructs … as well as a sense of how biological experimentation is done.” From computer science, he proposed teaching programming, data structures, algorithms, and databases. From math and statistics, he recommended basic probability theory, experimental design and analysis, stochastic processes, numerical and combinatorial optimization, classification, machine learning, and statistical inference. From ethics, he included study of technology’s effect on society and privacy.
The core bioinformatics content that Altman proposed was really a laundry list of what were then hot topics: sequence analysis and annotation, RNA and protein structure analysis (including structure prediction), evolution, databases, and laboratory informatics (LIMS and such). This list is obviously outdated today ¯ a mere three years after it was proposed — as it omits SNPs, microarrays, proteomics, EST clustering, cross-genome comparisons, and other topics that are hot now. I believe that Altman intended the list to change as the field advanced.
Altman’s proposal really drives home the point that what we call “bioinformatics” is only a small portion of what a bioinformatician has to know. He suggested that only three of 18 courses (17 percent) be allocated to these topics. Much of this material comprises techniques that are important today, but may be forgotten tomorrow.
The rest of our intellectual heritage comes from disciplines outside the field. These disciplines are too vast for us to have any hope of covering even their most salient content. Instead, what we get from these fields is a foundation to help us think about the problems that arise in our own field.
A key issue is to allocate the course time between specific bioinformatics content and foundational material from outside. If we spend too much time on current bioinformatics techniques, we run the risk of turning out students who are experts on obsolete methods. But, if we spend too little time, our students won’t be able to do anything useful without yet more training.
I applaud Altman for including ethics in his proposal. Few real programs have taken this up.
Example curricula
I looked at several of the academic bioinformatics programs listed in the careers feature in this issue (see p. 52).
The University of Pennsylvania offers a master’s degree in biotechnology with a concentration in bioinformatics. Students take courses in biochemistry, molecular biology, biotechnology (lecture, laboratory, and seminar), computer science (two courses), statistics (two courses), bioinformatics (two courses), and one elective. The bioinformatics courses cover sequence analysis, evolution, databases, maps, gene expression analysis, and other topics. The computer science courses are drawn from programming languages, algorithms, databases, and advanced databases. In total, this program looks a lot like Altman’s proposal, but with considerably more emphasis on biology, fewer unconstrained electives, and no ethics requirement. Also, the total number of courses is more realistic for a master’s program.
The University of California at Santa Cruz is planning to offer a master’s degree in bioinformatics. Students are required to take courses in molecular biology, statistics (emphasizing stochastic methods), bioinformatics (two courses), and ethics. The bioinformatics courses cover sequence analysis, structure analysis, genomic databases, evolution, and gene prediction. Students also select four electives from biology, biotechnology, biochemistry, computer science, software engineering, and math and statistics.
Georgia Tech offers a master’s degree in bioinformatics. There are no required courses as such, but the recommended curriculum includes two courses in each of genetics and biochemistry, one in statistics, one in mathematical modeling, two in bioinformatics, and one in artificial intelligence. It is also recommended that students choose between biophysics or graph theory, databases or scientific visualization, and protein structure or drug discovery. The list of other recommended courses includes more biology, numerical and combinatorial optimization, algorithms, parallel computing, graphics, and legal issues.
The University of California at San Diego is starting a PhD program in bioinformatics this fall. Students are required to take three courses in bioinformatics covering biological data and analysis tools, sequence analysis, structure analysis, and pathways. Also required is a statistics course. Students must select five electives from the following eight areas: biochemistry; molecular genetics; cell biology; data structures; algorithms; information retrieval, databases and data mining; mathematics and statistics; physics and engineering.
George Mason University offers a PhD in computational sciences and informatics with a concentration in bioinformatics and computational biology. Students take four courses in computational science, covering numerical methods, foundations of computer science, scientific and statistical visualization, and scientific databases. (I’ve classified this in the table as two computer science and two math and statistics.)They also take three courses in bioinformatics covering sequence analysis, structure analysis, and genomics. Students are allowed five electives drawn from computer science, statistics, and biology. Of note: This program requires no biology courses, and lists just one biology course among the recommended electives.
The University of California at Los Angeles offers a bioinformatics track that can be added to PhD programs in a variety of departments. The student has to apply and be admitted to one of these departments and then fulfill the degree requirements of that department. The bioinformatics program adds on to these requirements. Students are required to take three courses covering biochemistry, genetics, molecular and cell biology, genomics, algorithms, databases, information theory, statistics, and bioinformatics. In addition, each student must take three courses that complement his host program. For example, a student whose host program is in biology would take additional courses in math or computer science, while a student from math or computer science would take more courses in biology.
The University of Washington and the Fred Hutchinson Cancer Research Center jointly offer a computational molecular biology track that can be added to PhD programs in a variety of departments. There is only one required course — a two-quarter course that covers sequence analysis, structure analysis, evolution, and gene finding, with a strong emphasis on the mathematical foundations of these topics. In addition, students are required to do a rotation that involves hands on wet-laboratory work coupled with data analysis.
Pop quiz
Let’s take a pop quiz to see how well these curricula prepare students for the real world of bioinformatics. The core job of most practitioners is to develop software on behalf of biological researchers.
Imagine you’re a newly minted bioinformatician working in a biology research department at a pharma, and one of the scientists in your department makes the following request: “I’m generating a bunch of sequences. Tell me what they are.”
This sounds easy enough. Your training has covered sequence analysis in depth, and you know that BLAST is the standard program for analyzing short sequences. So, your first thought is to write a small program — a Perl script most likely — that takes the user’s sequences and runs them through BLAST. Of course, the really clever student will realize that many such programs already exist, including NCBI’s MegaBLAST program itself. Let’s ignore this shortcut for purposes of the quiz.
Sorry, but that answer is worth a C-. It doesn’t really solve the user’s problem.
Here’s some of what you need to cover to get an A+. After running BLAST, you have to store the results as files or in a database where you and the user can find them later. He said he’s generating sequences, so it’s a good bet that he’ll have new ones for you every day or so. Unless you’re planning to spend your life running BLAST for this guy, you’re going to need an automated system to process his new sequences as he generates them.
You have to do something to present the results to the user, because he probably doesn’t want to paw through reams of unprocessed BLAST output. You’ll probably have to process the BLAST output to extract the description lines, the scores and p-values, and the alignments, and store this information in a searchable database. He’ll probably want to see the results sorted by score or p-value, and to do text searches on descriptions.
For sequences that hit known genes, he may want you to pull information from RefSeq, while for sequences that hit known ESTs, he may want information from UniGene. Naturally, he’ll want to search and sort on this information, too.
So far we’ve ignored the question of whether BLAST is the right tool for the job. If the sequences are from human or another well-sequenced organism, you may want to align them to the genome, attempt to construct gene models, and analyze those models using Interpro or some other motif-finding system.
An A+ answer will be a large, complex system containing thousands of lines of software. Most of this has little to do with the stuff you learned in Bioinformatics U. The only “core” bioinformatics question was whether BLAST was the right tool, and we got pretty close to the end before that came up!
To make the grade
Most bioinformaticians are software developers. They need training in software engineering, a discipline closely allied with computer science that studies the practice of software development. They also need training in the pragmatics of computer technology, for example, to understand when to use Perl vs. Java or Linux vs. Solaris.
It’s unrealistic to expect academic programs to add these courses. For one thing, their curricula are already overstuffed. And few academics are terribly interested in the practicalities of software development.
A better solution is for industry to raise its hand. A consortium of companies could commission the necessary courses, and teach them to graduates of Bioinformatics U. when they start working. This seems a better solution than letting new grads learn it the hard way ¯ by making mistakes on the job!
SELF-CONTAINED BIOINFORMATICS TRAINING PROGRAMS
Name |
Degree |
Biology Courses |
Computer Science | Math & Statistics |
Bioinformatics |
Other & Unspecified Electives |
Total |
Altman proposal |
3 (17%) | 4 (22%) | 3 (17%) | 3 (17%) | 5 (28%) | 18 | |
Georgia Tech |
MS | 5-6 (42-50%) | 1-2 (8-16%) | 2-4 (16-33%) | 2 (16%) | 12 | |
University of California, Santa Cruz |
MS | 3 (30%) | 1 (10%) | 2 (20%) | 4 (40%) | 10 | |
University of Pennsylvania |
MS | 5 (45%) | 2 (18%) | 2 (18%) | 2 (18%) | 11 | |
George Mason |
PhD | 2 (17%) | 2 (17%) | 3 (25%) | 5 (42%) | 12 | |
University of California, San Diego |
PhD | 0-3 (0-33%) | 0-3 (0-33%) | 1-2 (11-22%) | 3 (33%) | 0-1 (0-11%) | 9 |