AT A GLANCE: Holds a PhD in computer science from Harvard University.
Co-developer of the Glimmer microbial gene finding system and the Mummer system for whole-genome alignment.
Hobbies include playing tennis and golf. Greatest excitement comes from spending time with his wife and two daughters.
Q Where will bioinformatics be in two years? Five years?
A In two years we will be deep into the analysis of the human genome, trying to identify which of thousands of unconfirmed gene predictions are real. One very important method for doing that will be comparison to mouse genome sequence. In five years I hope we’ll be seeing large amounts of DNA sequence data from other humans and other primates. This will open up a new world of detailed exploration of the specific genes that explain human variability and human disease.
Q What are the biggest challenges facing bioinformatics?
A Perhaps the biggest challenge is enabling the next generation of biological scientists to use all the data emerging from sequencing projects. Biology has become an information science. The laboratory skills are still key, but computational sophistication is rapidly becoming an essential skill. Advanced training in biological sciences should start emphasizing – even requiring – at least some training in computer science.
Q What are the bioinformatics challenges for TIGR?
A We are in the process of developing new methods for annotation of eukaryotic genomes, which in many ways are more difficult to annotate than prokaryotes because the genes have complex exon-intron structures and are much more spread out than in bacterial genomes. Repetitive sequence is a much greater problem, too. Another big challenge is the growing demand for annotation of incomplete genome sequences. Finding genes is very tricky, because many if not most genes will appear on multiple contigs. Attaching functional information to those genes is even harder. One of our first efforts is our recently released preliminary annotation for chromosomes 10, 11, and 14 of the malaria parasite. We are also moving into microarray analysis, with multiple experiments ongoing in bacteria, mammals, and plants.
Q Where does your funding come from?
A Nearly all our funding comes from US government grants, primarily from the National Institutes of Health, the National Science Foundation, the Department of Energy, the Department of Defense, and the US Department of Agriculture. We have some 30 genome sequencing projects all proceeding in parallel in our sequencing labs.
Q What hardware do you use?
A We use Compaq Alphas for our largest computing needs, in particular for sequencing assembly. Compaq has been generous enough to provide us on loan a 16 gigabyte Alpha ES40, a superb machine. We also have 4 Alpha 4100s with 8Gb each. For general purpose computing, Web services, and database servers, we have 10 different Sun UltraSparcs. For large-scale Blast searches and Hidden Markov Model searches, we have slightly less than 100 Linux boxes configured as a cluster and controlled by Oak Ridge National Laboratory’s Parallel Virtual Machine software and the University of Wisconsin’s Condor system. Most of our desktop machines in the bioinformatics department are now Linux boxes.
Q How would you compare the quality of publicly available and commercially available bioinformatics products?
A The commercial software we use tends to be very good, but for most of our specialized bioinformatics applications we have to “roll our own.” Obviously we use Blast (WU-Blast, actually), and for standard things like network, database, and Web services, we go with commercial products. Our more than 350 databases are all in Sybase.
Q How is the bioinformatics unit organized within the framework of the organization?
A Bioinformatics is one of TIGR’s largest departments, with over 60 staff members. The IT department sits inside bioinformatics and works closely with many of the bioinformatics analysts and engineers. In addition to IT, we have groups within bioinformatics working on annotation, sequence assembly and closure, microarray analysis, gene indices, and basic research.