Head of Informatics and Human Genome Analysis
Wellcome Trust Sanger Institute
Name: Tim Hubbard
Position: Head of informatics (since 2007) and head of human genome analysis (since 1997), Wellcome Trust Sanger Institute
Experience and Education:
– Research scientist, Medical Research Council Centre for Protein Engineering, Cambridge, 1990-1997
– Postdoctoral fellow, Protein Engineering Research Institute, Japan, 1989-1990
– PhD, department of crystallography, Birkbeck College, University of London, 1988
– BA in natural sciences (biochemistry), University of Cambridge, 1985
As head of informatics at the Wellcome Trust Sanger Institute, Tim Hubbard is involved in managing the flow of sequence data from the Sanger’s new sequencing platforms into data repositories.
He is also focused on genome annotation, for example through the Ensembl project and the ENCyclopedia Of DNA Elements project, and has roles in the Genome Reference Consortium and the International Cancer Genome Consortium.
In Sequence interviewed Hubbard about these projects at his office at the Wellcome Trust Genome Campus in Hinxton last month.
What is your role at the Sanger Institute?
I am the head of informatics. Part of that is IT – I chair a committee that handles the strategy for procurement. I am also member of the board of management with responsibility for institute informatics.
I also work closely with [the European Bioinformatics Institute.] [For example], there is a new short-read archive, [the European Trace Archive], which is going to be at EBI. This will mean that a lot of data has to flow between Sanger and EBI. The data flow rate is just so large that we need to facilitate that.
EBI and [the US National Center for Biotechnology Information], we are all thinking about how to deal with these very large repositories. Sanger is obviously generating that data, so it’s worried about that, too.
The [current] long read [trace repository contains] more than 100 terabytes [already]. That’s a big database, by anybody’s standards. But the new one is going to be possibly a petabyte in a year. These are just very large numbers. The storage capacity is increasing, but things like tape are certainly not keeping pace. You worry about what happens if something goes wrong and you have to restore from tape, it would take months. Some of this data cannot be regenerated, or certainly the cost of generation is still much greater than the cost of storage.
How do you transfer data between EBI and NCBI?
That fluctuates. You can do that over the network [but] maybe it is simpler to send disks via DHL. Right now, the big pressure on that is the 1000 Genomes project. They use a specialized piece of software to try to accelerate [data transfer]. It’s definitely difficult. I was at NCBI last week, and they also say very similar things about the problems, just the bandwidth required for these [kinds of] projects.
Tell me about your involvement in genome annotation.
Before I set up informatics, my main activity here was annotation, the large annotation projects of vertebrate genomes.
I am [co-head of the] Ensembl [project, which is] split between EBI and Sanger. One big part is the automatic gene annotation across all these genomes. The other big piece is the web team that builds the website, the Ensembl genome browser.
How are you going to handle thousands of genomes in such a browser?
If you look at the browser right now, it can already handle several strains of mice, and it has already multiple humans in there right now. Basically, the underlying engineering has been done.
If you think about the way Ensembl works, we evolve our way of storing data continuously. There is a new release every two months that has new data, but nearly every two months, the schema has changed slightly.
Ensembl was built on a database structure with application programming interfaces on top of that. Some users look at the website, some users program against the API. The API may not change very much, so people can write Perl scripts somewhere else in the world, and do calculations based on data stored in Ensembl by downloading this core software library that allows you to access the data stored here.
The whole idea behind that is to provide a layer of robustness. The way of asking for a list of genes is always the same, even if we have changed the way we store that list of genes. That allows us to reengineer the way we store things continuously. And that’s happened — we are on version 49 right now.
So the way of implementing how you store variation, how you store alternate humans, alternate strains in mice is progressively being reengineered to support larger sizes, larger numbers. There is the data storage, there is the layer of programming interfaces, and there is a way of visualizing it.
There are lots of potential ways you can visualize [the data], depending on what you want. But a lot of the problem is working out the best way to store it. Most of that activity and development is going on at EBI.
What about manual curation?
The manual curators [in the Sanger’s HAVANA, or Human and Vertebrate Analysis and Annotation group] only work on finished genomes. If you look at the human and mouse genomes, they are high-quality genomes, [but] they still get refined. That is kind of the end-game, and we have been manually curating those gene structures with a group of 20 or so annotators for years. There, the automated pipeline and the manual [annotation] kind of converge towards the complete set of genes. That’s now supported through the [ENCyclopedia Of DNA Elements] project. I am [a principal investigator] of this consortium, which [aims] to get this complete gene list.
The two subgroups, the Ensembl gene builders and the manual curators, [are working] towards a single reference gene set. If you go and look at Ensembl now, when you go and look at human, it says “Ensembl/HAVANA,” because the gene set there is a merge between the two.
We [also] have something called the [Consensus Coding Sequences] consortium, which includes NCBI and [University of California] Santa Cruz [in addition to Ensembl and HAVANA]. If we agree on the protein coding, we have exactly the same sequence, then we mark that as CCDS. So there is a subset of human annotation where everyone agrees, and we basically think we have converged.
How much of the human genome is that?
It’s 20,000 transcripts, or more than 17,000 genes. It’s not everything, by any means, and you could say we have done the easy ones. And it’s also only the coding regions, it does not include UTRs [where] it’s much harder to get agreement. [It is also] very few transcripts per gene, whereas we know the manual curators are annotating far more, maybe five or six per gene. So there is probably a lot of depth which is missing. But it’s the most reliable subset.
You are also involved in the Genome Reference Consortium, right?
That is something that has been gradually planned. If you think back to 2003, there was the publication [of] the majority of the human genome at the finished level. And [following that] there were publications of chromosome papers for a long period. After all of those papers, there are [still] a certain number of gaps [and] some ambiguous regions.
The question was, what to do about looking after this in the long term? There were a number of discussions of the International Human Genome Consortium, mainly held at the Cold Spring Harbor Biology of Genomes meeting, about how to take this forward. Now, we basically got agreements that Sanger and [Washington University in St. Louis] will take care of the experimental work — take over, basically, all the chromosomes — and look after them indefinitely, and I think everybody is happy with that.
Will the 1000 Genomes Project fill some of the gaps?
It’s not specifically for that, but the data is there, and it will get integrated. We have allocated resources, both WashU and Sanger, to do a certain amount of experimental investigation.
Also, there is informatics here, and it will be linked to EBI, because they are part of the [Database of Genomic Variants] consortium. So there will be data coming in from that, and then NCBI is tracking these places where people have asked for investigations. It’s like a bug tracking report for a piece of software, so you can publicly see what the state of that is.
[The GRC will keep the human genome updated] in the light of the evidence that comes out of the increased amount of sequencing. That, obviously, will also feed into the annotation, and vice versa. Sometimes you get an mRNA which doesn’t align to the genome sequence, so there is a question, is the genome sequence wrong, or is it just mRNA from a different individual that is different? So these things get flagged and investigated.
What about the International Cancer Genome Consortium?
I am on one of the committees for that, too. You can think of it alongside the 1000 Genomes project — in terms of data, there are very similar issues, really.
We have the [Catalogue of Somatic Mutations in Cancer] database here [at the Sanger Institute], a repository for somatic variation [in cancer]. That’s providing some components.
But obviously, the raw sequencing data doesn’t go into a database like that, it goes into the sequencing repositories. So we see federated data structures, linking all these things together. If you look at the way that the 1000 Genomes is constructed, and the ENCODE project is constructed, we have what’s called data coordination centers.
The landscape of all this, because of the large data volumes, I think becomes federation, but you link databases together dynamically. Some specific federation technologies that we use extensively are [Distributed Annotation System] and Mart.
One of my big points here is that just because it is cancer, it is still sequence data, and it is still array data. This is one of the things we have to deal with in bioinformatics, we do not want to have too many databases, because it just makes it harder to integrate data. So if you have one data type, you should have one repository and not set up a completely new one, just because it has got a different label.
At Sanger, for example, we have to have a staging area for the data that comes off the sequencing machines, and we do the initial pre-processing here, too, but we don’t want to store that forever. We want to give it away to a database that is going to look after it forever. So all the producers in these sort of things, they are going to have to have their own informatics, they are going to have to handle the pipeline that ultimately allows that data to go into repositories.
What can we learn from the 1000 Genomes project?
We are obviously going to learn a lot from things like the 1000 Genomes project about just the structure of the human population, which is going to provide a huge chunk of background information to help understand genetic studies, such as the Wellcome Trust Case Control Consortium. It’s so confusing, when you do these studies, you get confused by the ancestry information of individuals. Improving the reference, having a complete set of annotations, this will help enable just basic research.
At the same time, you have the medial side. In the UK, of course, we have the [National Health Service], and there is quite a lot of effort going on to link these things together. I am on the E-Health committee, which is between the researchers, and is exemplified by the Medical Research Council, and the NHS research, which has built a huge patient record system for the whole country. We want to research-enable that, to make it easier to set up collections for, basically, statistics in the population, which will help influence healthcare.
I think the direction is going to be, genotyping and sequencing cost will keep dropping. For the individual, these relative risks are not really helpful. But if you stratify the population — if you think of the NHS deciding how it is going to screen patients to get the best healthcare outcomes across the country — then there probably are things that can be done. For example, there are a certain number of diseases, and probably we are all in the one-percent risk category for one of those. It doesn’t help the individual very much because the risk is really meaningless. But if you aggregate, then you can probably target people that should be screened for certain things and help the overall process.
This is thinking several years ahead, maybe a decade ahead, but bits of this are beginning to be constructed. There will be database infrastructure for keeping a record of variations and what we have deduced about phenotype, and there will be research projects contributing to that to improve these risk calculations. The [next level of annotation] is going to be how the variations are affecting genes and functions.