AT A GLANCE: Holds Ph.D. in genetics from Cambridge University, undergraduate biochemistry degree from Oxford University. Began working for EBI in January, running the Ensembl project. Enjoys running and cooking.
Q: Where will bioinformatics be in five years? Ten years?
A: I suspect part of bioinformatics will just disappear into general research biology, much as molecular biology has. Gathering datasets, manipulating and interpreting them will become part of everyday life. But on top of that I see the field still growing and probably merging with other “informatics” fields, such as econometrics or aspects of social studies. Bioinformatics is the best example in my mind of “applied computer science” and I think we will be leading this field for a while.
Q: What are the biggest challenges bioinformatics must overcome?
A: There are many: Large data sets, complex data sets, heterogeneous data sets. Different parts of bioinformatics stretch different aspects. Storage, compute resource, algorithmic ability or simple, straightforward software engineering are all problems for some areas. Undoubtedly the biggest problem is getting skilled people into the field. It is not about bringing in biologists or computer scientists any more; it is about training real bioinformaticists, regardless of their background.
Q: What hardware do you use?
A: At the Hinxton campus (where both the Sanger Center and European Bioinformatics Institute are) there is mixture of Compaq Alpha, SGI, Sun Microsystems, and Linux boxes. The largest compute and storage systems are built from Compaq Alphas, but we have a number of cost effective Linux farms as well.
Q: Which databases do you use? Public, proprietary or third party.
A: Being one of the main international sites for databases, the EBI hosts a large number of public domain content databases. These are managed in a variety of implementations: Oracle for the large, primary archive databases, such as the European Molecular Biology Laboratory data library, which has been stably managed at the EBI now for over a decade. SRS plays a role for managing smaller databases. Inside Ensembl we use the open source RDB MySQL heavily. MySQL for us handles the throughput well, is easy to administer and can run on laptops, giving everyone a development environment they can take home. I wouldn’t use MySQL for everything however—its lack of transactions and foreign key restraints would scare someone concerned about watertight data integrity. I expect MySQL to improve steadily over the next couple of years in this data integrity area with the announcement that MySQL is being re-licensed under the GNU Public License.
Q: What bioinformatics software do you use? Do you use in-house developed or third party software?
A: I use a lot of open source bioinformatics software. Open source software is a great fit to bioinformatics as it is hard to provide “one size fits all” software for bioinformatics, and in any case, the real value is in the data, not the software. Ensembl is a big Perl system at the moment, and sits on top of Bioperl as a base level bioinformatics library. We are planning to transition over to Java, again using the open source BioJava project as our base level library. Ensembl itself uses many pieces of academic software, such as Genscan, Est2Genome and GeneWise.
Q: How do you integrate your data?
A: With hard work and good algorithms! There is no magic bullet for data integration, just understanding and sweat.
Q: How large is your bioinformatics staff? Is the organization hiring additional bioinformatics staff?
A: I think the Hinxton campus has the largest number of bioinformatics staff anywhere in the world, with upwards of 200 people doing bioinformatics as a central part of their work. Ensembl is growing as well.