Head of pathogen genomics and
Director of sequencing
Wellcome Trust Sanger Institute
Name: Julian Parkhill
— Head of pathogen genomics, Wellcome Trust Sanger Institute, since 2006
— Director of sequencing, Wellcome Trust Sanger Institute, since 2008
Experience and Education:
— Senior investigator, pathogen sequencing unit, Sanger Institute, 2003-2005
— Senior group leader, pathogen sequencing unit, Sanger Institute, 2001-2002
— Project manager, pathogen sequencing unit, Sanger Institute, 1999-2001
— Senior computer biologist, pathogen sequencing unit, Sanger Institute, 1997-1999
— CRC research fellow, Institute for Cancer Studies, University of Birmingham, 1992-1997
— Research fellow, School of Biological Sciences, University of Birmingham, 1991-1992
— PhD in bacterial molecular genetics, University of Bristol, 1991
— BSc in biological sciences (genetics), University of Birmingham, 1986
In his dual role as head of pathogen genomics and director of sequencing at the Wellcome Trust Sanger Institute, Julian Parkhill is helping to organize, and is using, the institute’s sequencing platform.
During a visit to his office in June, In Sequence talked to Parkhill about how the institute makes new sequencing technologies available to its faculty, and how they are used in a variety of pathogen sequencing projects.
What is your role at the Sanger Institute?
My role as director of sequencing is a very recent one, and it is really a strategic role, so it is not really operational. Carol Churcher is head of sequencing operations, Harold Swerdlow is head of sequencing technology, and Tony Cox is head of sequencing informatics. We have decided that sequencing is large and complex, and it’s best to have an operational management team, rather than a single head of sequencing. My role as director of sequencing is really to give some strategic direction to the group, and primarily to act as a link between the sequencing group as a service group and the faculty.
The model that we are trying to operate is that the projects that drive sequencing come from the faculty. You have seen that with the 1000 Genomes Project. The pathogen projects, which I’ll talk about later, the [copy number variation] projects, and the cancer projects, all of these drive sequencing, and sequencing has now been built as a service group that serves all those.
How does this work in practice?
The institute has built an infrastructure for sequencing, and underwrites that capacity as a core function of the institute. And that then gives the faculty the freedom to propose projects to fit within that. Some of the larger sequencing groups have defined capacity within that, for example the human variation, the cancer sequencing, the pathogen sequencing [groups]. But all of the faculty are able to come to the sequencing committee and propose projects that can be done within the capacity. So it’s not a question of having to justify each part of capacity. The infrastructure is there, and we are after the best projects to use it on.
Have you made any changes since you assumed this new role?
The main changes since we have taken this on in terms of operations have been taking sequencing and trying to build it as a service group that is responsive to the whole faculty. And, of course, bringing the new technology to full production, at the same time as moderating the use of the older technologies.
Tell me about your other role, that of head of pathogen genomics.
We [used to be a pathogen sequencing unit, but] we have become pathogen genomics, because we are much broader than just simply sequencing now. Our interests are very broad; they range from bacteria, bacteriophages and viruses through eukaryotic single-cell parasites to large parasites such as the helminths worms. My particular interest has always been bacterial pathogens, and we work on a large number of them, including Salmonella, Streptococus, Staphylococcus, Mycobacterium, and a range of others.
On the parasites, malaria is a strong focus of the group, but also Leishmania and Trypanosoma and others. And we are starting up now a large helminths program, in collaboration with a number of groups, and also with [Washington University], where they have a helminths program as well. Helminths are interesting because they have large and complex genomes. They are very understudied — the class, mostly, [causes] neglected diseases. They cause a lot of morbidity but not a very great deal of mortality, relatively speaking.
What diseases do they cause?
Helminths is a broad term that covers a lot of phylogenetic diversity, but we are looking at Schistosoma, which cause schistosomiasis; tapeworms, Echinococcus; threadworms, Strongyloides; also roundworms like Haemonchus, Ascaris, and others; Trichuris; Onchocerca, which is the worm that causes river blindness, etcetera. These cause a lot of morbidity, but they are fairly understudied. Actually, the research basis has been declining, and that’s partly because they are very difficult to work with. They are not easy to do genetics with, and we feel that providing a genomic platform will enhance research on these organisms. What we are trying to do now is build reference genomes of some of the most important human pathogens, but also related model organisms where genetics can be done.
The depth with which we work on [these species] depends on how long we will be working on them, and on the organisms themselves. With the bacteria, we started off 10 years ago doing reference sequences. That includes things like Mycobacterium tuberculosis, Salmonella typhi, Yersinia pestis; a lot of these key references. And then, over the years, we have moved through comparative genomics, [comparing] two or three or four [species]. And now with the new technologies, we are moving into very large-scale variation detection. We just published a paper [in Nature Genetics] on 19 Salmonella typhi strains [using] 454 and Solexa sequencing, and now, for a number of Salmonellae and Staphylococci and others, we are putting hundreds of strains through the new [sequencing] technologies, doing variation detection.
With the parasites, we are still mainly [doing] comparative genomics. The reference sequences were a lot more recent, the Trypanosoma cruzi and brusei papers that came out recently, the malaria paper a few more years ago, and Leishmania. We are now working through comparative genomics on those, and with malaria, we are moving into very large-scale variation detection, sequencing dozens of Plasmodium strains.
But we are also moving into biology. We have Gordon Dougan and his group on site who has very strong interests in enteric pathogens and the biology of enteric pathogens, the interaction with the hosts. That’s where there is synergy with the mouse knockout groups; we can look at host-pathogen interactions both in terms of knockout mice and in terms of the pathogen itself.
Similarly, we are expanding in the biology of malaria. We have just appointed some malaria biologists — [focusing on] mouse malaria and human malaria — who will be doing large-scale investigations of malaria biology, especially interaction with the host.
And then we are using the new [sequencing] technologies particularly for transcriptomics studies now, which works really nicely in bacteria. We have this mantra, ‘broad and deep.’ With some pathogens, we are going very, very deeply and doing biology, and doing knockouts, and doing proteomics and transcriptomics, but we are also maintaining this interest in the breadth of organisms at the sequence level.
How do you use second-generation sequencing technologies for these different projects? Do you have preferences of one over another for certain applications?
At the moment, we are generating reference sequences with a mixture of 454 and capillary [electrophoresis] data and some Solexa data, and in some cases 454 data plus Solexa data. We are using the Solexa [machines] primarily for variation detection, SNP detection, indel detection. I think that will change over the next few months or so. I think it’s very clear now that we will be able to use Solexa and those kinds of technologies more and more for de novo sequencing and reference sequencing.
Do you also use Applied Biosystems’ SOLiD technology?
Yes, we have got five SOLiDs. At the moment, they are primarily running a large cancer project. We have started experimenting with them for bacterial variation detection as well. We try to be platform-agnostic as much as we can. Different platforms have different advantages and disadvantages, so obviously, there is an advantage in maintaining all the platforms, and there is also a cost of maintaining all the platforms, as well. But at the moment, we are running things in parallel.
Are you also still using microarrays?
We have done microarray work in the past, and we still do some microarray work, but now, all the microarrays we use are commercial microarrays. We buy them in from Affy or [Roche/]NimbleGen or Agilent and use them experimentally. As I say, we are really starting to explore direct transcriptome sequencing, especially on bacteria, as it works so beautifully. For bacteria and small eukaryotes, with a relatively small amount of sequencing, you can get a very deep coverage of the transcriptome. And that means that unlike microarrays, you get base-pair resolution of where transcripts start and stop. You identify things that you would not be able to see on the array, so you can immediately see small RNA genes that were un-annotated before. And with the eukaryotic analysis, the parasites, you can see all the splice sites to the base pair.
[For example,] with the helminths, they have very unusual genome structures in terms of recognizing genes, which is very difficult. They are just a long way out on the phylogenetic tree from anything that is well studied. And it is very difficult in some of them to do accurate de novo gene prediction. But if we can do RNA direct transcriptome sequencing, then we can use that to guide gene identification, or even identify the genes directly, and the splicing.
Do you prefer long reads or short reads for transcriptome sequencing?
For transcriptome sequencing, the short reads, because assembly is not a problem. We are looking at mapping, and the greater your depth of coverage, the more dynamic range you have in terms of the transcriptome. For a bacterium, the whole transcriptome is only a few megabases, so you get very large oversampling of it from even a single lane on Solexa.
You are also involved in the International Human Microbiome Project. What is your role in that?
My role is part of the MetaHIT European Commission-funded program (see In Sequence 6/10/2008), and our role within MetaHIT is the sequencing of reference genomes. We want to sequence, initially, 100 or so reference genomes because, as we and others recognize, the metagenomic data that is being produced will be much more interpretable if you have a good set of reference genomes to compare it to. We are just about to kick that off and start sequencing some reference genomes. I have also been asked to co-chair the strain coordination committee for the international consortium. So we will make sure that there is a sensible overlap between what groups are doing in terms of generating reference sequences, that we are avoiding unnecessary duplication. Clearly, having multiple genomes is useful sometimes. Karen Nelson at the J. Craig Venter Institute is the other co-chair, and that is being steered by [the US National Institutes of Health].
What are the major bottlenecks and challenges?
I think for the reference genomes, the major bottleneck is going to be getting strains. Because we can scour our strain collections, and we can start isolating new strains, but predominantly, the vast amount of species in the gut are uncultured, or not yet cultured. So a lot of groups are working on how to address that, including us. We are systematically trying to culture things from the gut, but also, trying to work out ways of amplifying and sequencing chromosomes from uncultured bacteria.
We are experimenting with laser dissection, which should enable us to at least identify things to the genus level with FISH probes, and then extract them and maybe start sequencing.
Do you see any bottlenecks in terms of doing the analysis quickly and automatically?
Yes. I have always been a strong advocate of manual annotation, but there comes a point where it is not feasible, when you start dealing with hundreds of strains. So we have to find a way to make automated annotation robust, or at least make it clear what is automated and what is not automated, and what the levels of trustworthiness are on the automated data. But yes, we will have to start to do large-scale automated annotation that will not get manual curation, and to be honest, that worries me, but what can you do?
So these will not all be finished genomes?
No, these will all be draft genomes. We are hoping that they will be closed, contiguous, and all the evidence is that the draft assemblies you get with the new technologies will be on the order of a few dozen contigs per genome, maybe fewer. And it should be relatively easy to close most of those gaps with PCR in a relatively automated way. But they are not going to be finished in the sense of every base pair checked, as sequences used to be.