Head of the informatics team, human genome-analysis group, and Ensembl project
Wellcome Trust Sanger Institute
Tim Hubbard leads the informatics team and is head of the human genome-analysis group at the Wellcome Trust Sanger Institute in Hinxton, UK. He is also team leader of the Ensembl genome browser project at the Sanger Institute, complementing the work of Ewan Birney at the European Bioinformatics Institute.
Hubbard spearheads a host of other projects at Sanger, including development of the Distributed Annotation System in collaboration with Lincoln Stein’s group at Cold Spring Harbor Laboratory, the Encyclopedia of DNA Elements project, and a research effort to develop motif-prediction algorithms for annotating vertebrate genomes.
BioInform caught up with Hubbard by phone to discuss a number of projects he’s working on. What follows is an edited version of the interview.
What should we know about Ensembl these days?
Ensembl’s been very successful as a joint project between Sanger and EBI, and continues as ever. [With] every release we add more functionality and add more data. But the broader thing [is that] Ewan [Birney] now has all of the sequence databases at EBI to deal with, and if you look at the EMBL archives, and things like EMBL, Genbank, DDBJ, there isn’t a genome browser for a lot of that data. And so basically the plan now is to expand the Ensembl brand and create this overarching [system]. So we [will] keep Ensembl just as it is, but we then have a collective thing called Ensembl Genomes.
And we will have Ensembl Bacteria, which will be an Ensembl for bacteria; Ensembl Plant; Ensembl Metazoan; and basically … Ensembl, the standard Ensembl, which is vertebrate, right down to chordate, will stay exactly the same where we deliver the full genebuild, everything. That’s our core mission.
The idea is there’ll be collaborations between the kind of engineering stuff that’s been developed in Ensembl and groups that have expertise in the annotation who will be outside this campus.
The advantage of this is that it provides a front end for all of that sequence, across basically the whole of biology, eventually. So in terms of having browsers, [the plan is] to have them at least on the organisms where there's a community. There is a problem that with the new [sequencing] technology it's going to become cost effective to sequence anything in biology, but not necessarily cost effective to annotate it. ... I think everything will get sequenced. But [for] a lot of things, there won’t be a community doing experiments, and we just can’t invest in doing annotation on all of these things.
So there will be a leveraging of the Ensembl infrastructure, which handles comparative genomics. Ewan’s working on a graph structure for just handling all that sequence and relating it so that you’ll be able to see all these other organisms [that] aren’t annotated. You’ll be able to see how that comparatively relates to the organisms [that] are annotated.
What’s your timeframe for this? What’s your deadline?
This is a big structural change. I think they plan to have something in at least bacteria sometime next year. But, it will take a while. It partly involves getting external funding because [in] some of these areas there just isn’t enough funding right now to support creating browsers.
So we have the framework. We have the commitment to try and organize things in this way. EBI already has something called Genome Reviews, which is sort of a precursor to this, which is what they were trying before using the Ensembl framework to just layer on top of GenBank. But this is going a bit further than that.
Anyway, that’s Ewan’s stuff really. That’s kind of centered on the EBI.
How does that arrangement work? Do you have weekly meetings?
We now have this structure this year where basically me, Rolf [Apweiler, head of the protein and nucleotide database group at EBI], and Ewan jointly run meetings which span across campus [and that include] about 180 people. Those are a series of seminars for all the various sub-groups, but also coordinators’ meetings of all the different project managers within that collection of teams.
The idea is to make sure there’s communication and progressive integration of all these resources. We have all the Ensembl views, but then there are views that are similar within Uniprot and some of the other resources at EBI. There are things we have at Sanger like Pfam, the existing bacterial resources. There [are] lots of things where we can improve the integration between these different things and make it just easier to be able to use this resource.
What have you been up to in terms of database design?
There’s obviously the whole Ensembl project. That’s a progressive evolution, in terms of the core parts of that. …That’s all pretty static now in terms of the actual underlying SQL tables, structure, and things.
The actual evolution is happening on the comparative side and on the variation side and that’s all coupled with the huge amount of extra data that’s starting to be generated. We’ve made a major investment in [next-generation] sequencing machines. We built an IT infrastructure; we put in 340 terabytes of disk cache just to handle those machines, just the temporary processing area.
Which next-gen sequencing machines do you have?
We have close to 30 Illumina [Genome Analyzer platforms], we’ve got AB’s [Applied Biosystems SoLiD systems], and 454’s [instruments] as well. So we’ve made a real major commitment in this [area of next-gen sequencing].
Do you think the [bioinformatics field] has enough software developers?
There’s a shortage. You must have noticed at [the Genome Informatics] meeting that there were a lot of people in Europe advertising [for help].
There was a big problem a few years ago in recruitment in this area, and then there was a number of calls — at least in Europe — of quite a lot of universities starting MSc courses in bioinformatics. And that kind of provided a flow of people with some of the skills, or at least a grounding in the skills relevant for us to recruit them and use them in developer roles. But it has become more difficult recently, and we think some of that is probably the Web 2.0 business because there [are] so [many] attempts by people to re-engineer their websites, [which] are sucking up people. People who’ve been trained in bioinformatics or [who] we would recruit tend to have those kinds of transferable skills as well.
So there is a bit of a shortage at the moment. It may be very temporary. And I think there’s just a growth … of data in this area. You have all these new high-technology machines and they are all producing data and that [has] to be dealt with, and that takes people.
In terms of handling this new technology, I think we [at the Sanger Institute] are in a good position. We already have a pipeline set up so we can handle that many machines continuously and we’ve put the necessary IT in place for that as well — not just disk space but also a compute farm dedicated to that [for] processing the output from most machines.
But I think a lot of people have been struggling with the new technologies. Particularly small groups [have] been buying the new machines and then suddenly realizing that every three days it generates a terabyte [of data] that you have to do something with.
Can you provide an update on your work in protein structure prediction?
I don’t do that anymore. I [was] a co-organizer of the [Critical Assessment of Structure Prediction] competition until last year when I stepped down because of becoming head of informatics [at Sanger]. So I’m actively connected with the field and I am still involved in the Structural Classification of Proteins database.
So I’m kind of connected with that structure-classification area and the CASP involvement, but prediction? I am leaving that to people who are meeting with CASP.
Can you explain DAS and your work with Lincoln Stein’s group?
You asked what I was doing with databases. I would say one of the things I’m doing is pushing federation technologies in lots of different arenas. … [For example,] we were having a meeting yesterday about what are we going to do collectively onsite at Sanger about human genetics data that is going to be delivered as a result of all this high-throughput sequencing and genotyping [projects]. We need databases for that, and I think the underlying message for all of this is we can’t put all this stuff in one place. What we can do is try and at least make these databases interoperable dynamically with application programming interfaces [APIs].
One of the big successes in Ensembl has been people can program against it. We have this layered mechanism of access. You can look on the Web; you can data mine using BioMart, and if that doesn’t work you can go and program against it directly using an application programming interface, and a lot of people are doing that. It’s becoming very popular. … Up [until] now, people who’ve been making data available have either been flat files, which are quite frequently huge [and] which you can download, or they’ve been a web interface.
A web interface is fine for looking at individual things. It’s useless, though, if you want to do a general analysis. And flat files are fine, but then it is an effort to download them and then work out [what] the format is, et cetera, et cetera, and then deal with that.
This API thing, then, is an intermediate solution; it’s web services to be combined. The idea about DAS is it’s a standardized web service. When people do make web services available, frequently every web service is different, so every time you want to connect [to] somebody else’s service you have to discover what that service is about, the structure, et cetera. The advantage of DAS is if it’s a DAS service you know that you can just combine it. If you are set up to deal with DAS services then they’re all the same. It’s a cheap way of integrating.
So that’s addressed the individual feature type integration, the federation at the sort of data-mining level. So we’re trying to push this everywhere to link databases in these sorts of ways. When you do those sorts of things, a lot of the problem is discovery. People may set up a service but nobody knows it’s there, so one of the things I did specifically in my group was set up a DAS registry, which now has more than 300 things registered in it. It’s dasregistry.org and it’s kind of the only global registry [for this purpose].
There are quite a few [other] projects we have: there’s the Biosapiens project, which is a European Union-funded project, and that’s around human genome annotation, but it’s at the sort of protein sequence and structure level. There are a whole lot of labs across Europe [and] they’re all basically providing annotation on protein sequence and structure. They’ve been using DAS as the integration technology for that. So all those sources are in this registry.
I will [also] be using that for the [Encyclopedia of DNA Elements] scale-up project as well to integrate all these different groups that are involved in my consortia.
How is that going?
We’re just at the recruitment stage right now, but we’re scaling up what we did in the GENCODE project [to identify protein-coding genes in ENCODE regions, which was] led by Roderic [Guigó of the Centre de Regualció Genòmica in Barcelona, Spain]. We have all these automatic systems here, we’re very strong with the Ensembl automatic genebuild, but we can do better with the manual curation and we can do better still with the option to experimentally verify, and that was all evaluated in the EGASP experiment, which was published in [a special issue of] Genome Biology [Guigó R, Flicek P, Abril JF, et al. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biology 2006, 7(Suppl 1):S2doi].
That was [like] a CASP competition basically, and this time just for gene structures, and it was a good marker to evaluate [which] experiments are useful to do, [and] how good [the] computation [is]. It’s not as good as manual [annotation]. That’s the conclusion from that.
So basically we’ve just taken what we learned from the pilot and scaled it up [from 1 percent to 100 percent]. But it involves a few new computational groups in the US who have developed some novel methods that should help identify some things, some things that have just been missed before.
What else are you involved in?
While my primary activity is vertebrate [genome] annotation , I also have a research group of mainly PhDs [that] works on pure machine-learning methods for [motif] prediction. In particular, we have a motif-prediction algorithm [that] can be used to discover [more than 100] motifs in a single step. There is a paper published on the pilot usage in flies [Down TA, Bergman CM, Su J, Hubbard TJ (2007) Large-Scale Discovery of Promoter Motifs in Drosophila melanogaster. PLoS Comput Biol 3(1): e7], and a database associated with this called Tiffin.
Philosophically, although we will learn a lot about biology through data integration, we are never going to be able to predict consequences of unique new mutations … unless we can do ab initio prediction of gene structures, regulatory features, et cetera. So just like protein structure prediction — which is making progress [that is] measurable through CASP — I believe we have to keep working on pure ab initio methods until we can really predict directly. Given our success in motif prediction, which does not rely on comparative genomics — unlike many other methods — I'm optimistic.
The challenge now is to add value to the literature by building tools on top of the text, linking to databases and generally helping the scientific process by integrating literature with data.
Also, the UK has started to coordinate activities in [its National Health Service] which is building electronic patient records systems with the [Medical Research Council and [the] Wellcome Trust. In the era of large-scale sequencing and genotyping, patient collections such as UK Biobank, and personal genomics, there's lots of potential here to help research by better database integration. I'm on the new government oversight committee for this, [which is] part of [the MRC’s Office for Strategic Co-ordination of Health Research].
Overall, it’s a hugely exciting period.