Name: Margret Hoehe
Position: Group leader, Max Planck Institute for Molecular Genetics, Berlin, since 2002
Experience and Education:
CSO and co-founder, GenProfile, 1999-2002
Group leader, Max Delbrück Center for Molecular Medicine, 1994-2002
Visiting Scientist, Harvard Medical School (lab of George Church), 1992-1994
Postdoc/staff member, Clinical Neurogenetics Branch, National Institute of Mental Health, 1987-1992
Research fellow, Neurochemical Department, Psychiatric Hospital, University of Munich, 1982-1987
PhD in Neuroendocrinology and Neuropharmacology, University of Munich, 1986
Diploma in Psychology, University of Munich, 1983
MD, University of Munich, 1980
Margret Hoehe leads the Genetic Variation, Haplotypes and Genetics of Complex Disease group in the department of Vertebrate Genomics at the Max Planck Institute for Molecular Genetics in Berlin. Last year, she won a three-year, €2.7 million ($3.9 million) grant through the German government's National Genome Research Network Plus funding program to sequence major histocompatibility complex haplotypes by use of fosmid libraries of 100 individuals on the Applied Biosystems SOLiD platform. In Sequence visited Hoehe in her office in Berlin last month to talk about her work. An edited version of the conversation follows.
What are your research interests?
Our research is based on a longstanding commitment on my part to genetic variation and its underlying haplotype structures, including the molecular haplotype structures.
In the past, we have carried out large-scale deep resequencing studies of biomedically important candidate genes and have performed haplotype-based disease association studies. For example, we recently performed a large obesity study, where we analyzed 83 candidate genes in 1,500 severely obese and healthy individuals and identified quite a number of significant risk profiles.
Following up on deep resequencing-based identification of variants, we have developed novel approaches to analyze haplotype-genotype-phenotype relationships. It became pretty obvious that once you deeply sequence candidate genes of interest, tremendous diversity of haplotypes will result, and this requires approaches to reduce complexity and to be able to establish correlations to the phenotype. I think this is a point that will become much more important in the future, as many large research groups move into deep resequencing.
We have also performed a comparative evaluation of haplotype structures, as inferred from deep resequencing data and from the various versions of the HapMap. Basically, we have found that the current SNP and HapMap databases still do not allow complete identification of common haplotype structures.
In addition, we have also moved towards the analysis of molecular haplotype structures. For that purpose, we have established a worldwide unique haploid reference resource of 100 fosmid libraries, which should allow us to analyze the molecular haplotype structures of any candidate genes of interest, of genomic regions, of disease gene regions, and ultimately, genome-wide.
Where do the samples for these libraries come from?
The libraries have been established from individuals of a representative German population cohort, PopGen, from the University of Kiel. It's a collaboration with Stefan Schreiber and Huberta von Eller-Eberstein. The samples are phenotypically characterized for 300 phenotypic items and have all been genotyped by Affymetrix 1,000K chips. In addition, once we knew that we would move towards molecular MHC haplotype sequencing, we also performed four-digit HLA typing of all the samples to identify a broad spectrum of MHC haplotypes, including risk profiles.
Our fosmid clones, which are roughly 40-kilobase haploid DNA segments, are formatted in pools of fosmids, so-called haploid clone pools — three 96-well plates with 5,000 clones per pool, or, alternatively, in superpools, where we combine the three 96-well plates into one to have superpools of 15,000 fosmids. On the one hand, that's much easier to handle, on the other hand, if we wanted to pull out region-specific clones, we need to develop approaches to map their presence and position, and we need to test and apply approaches to select or enrich such fosmids from the pools.
What enrichment method are you using to fish out fosmids of interest?
We are currently in the process of comparatively evaluating a spectrum of front-end technologies. Most of them have been developed for enrichment from genomic DNA, but we have to take into consideration that we enrich fosmids, which involves some testing and adaptation to this resource, and potentially a little bit of technology development.
[ pagebreak ]
In particular, we are currently testing the Febit HybSelect array. We are also working with the NimbleGen 385K capture array, which we are testing in collaboration with Bernhard Korn from the German Cancer Research Center. We are gearing up to test the Agilent SureSelect solution-based technology, and possibly, we will look into the microdroplet technology by RainDance. We will see how well each technology will perform, how large the regions are that they cover, and possibly, we will end up with a combination of methods.
The question is also how the hypervariable regions in the MHC, which also happen to be functionally very important, will perform, and how we can best combine approaches to capture the MHC region entirely. We have done some substantial oligo probe design here in house. If we end up with an entire probe panel for MHC, this will be of great value, because that will probably immediately result in diagnostic testing devices and oligo panels or chip-based technologies, which many researchers, including those in transplant medicine, will be able to use.
How did the MHC haplotype sequencing project come about?
When we all applied for the current NGFN Plus funding period, it was clear that we wanted to carry out analysis of molecular haplotype structures underlying disease regions, based on our established haploid reference resource. When we discussed which region to establish molecular haplotype structures for, there was general interest of the entire network to analyze the MHC region, because this is the most disease gene-rich region in the human genome.
Many disease phenotypes have been mapped to the MHC region, for instance psoriasis, sarcoidosis, diabetes type 1, schizophrenia, alopecia areata, allergies, and asthma — in general many infectious diseases, autoimmune diseases, and inflammatory diseases. Also, this region is of extreme importance for transplant medicine.
The MHC region is highly variable. Just to illustrate, there are more than 70,000 SNPs in this region — and a plateau has not been reached — and there is a large number of structural variations such as insertions, deletions, and copy number repeats.
Once you have identified a disease association, the next step would be to identify the potential causative variations. Because the MHC region is so variable, and because linkage disequilibrium may extend over several megabases, the most promising approach is simply to completely and directly sequence the entire region.
The MHC Haplotype Project conducted at the [Wellcome Trust] Sanger Institute was the first large sequencing study of the MHC region, analyzing eight common haplotypes. They sequenced DNA from BAC libraries prepared from HLA-homozygous consanguineous cell lines, so they would not need to tear apart both molecular haplotypes. This was the first study that also demonstrated the tremendous complexity of that region at the DNA sequence level. Amongst these eight haplotypes, there are, just within one defined SNP-rich region, almost 40,000 SNPs, and each haplotype may differ from the PGF reference haplotype sequence by up to 16,000 SNPs and up to more than 2,000 indels.
This also leads to the conclusion that there may be huge differences in gene and sequence content between haplotypes. And that tells us that the currently applied approach of mixed diploid sequencing may no longer be sufficient. The consequence is that you really have to sequence and assemble both underlying molecular haplotypes separately.
Another conclusion from these first MHC sequencing results is that many more potentially disease-causing variants must be out there, that the full variation content of the MHC has by far not been mined, and that the next step would be to really analyze MHC haplotype sequences both at the population level at greater depth, and in a substantial number of diseases.
What's your plan for doing that?
Given our haploid reference resource of 100 fosmid libraries, we have the perfect resource to be able to assemble both MHC haplotypes separately, and moreover, sequence MHC haplotypes at greater depth. Based on the format of our libraries, we pursue three different approaches in parallel, in order to always stay flexible and incorporate the most recent technology developments.
We first pursue a classical approach, which means we map MHC-positive fosmid clones in the libraries to the specific haploid clone pools by applying an MHC SNP-mapping panel. Based on the information we collect, we isolate the specific fosmids that can then be grouped into two tiling paths for the two molecular haplotypes of an individual and subjected to next-generation sequencing.
[ pagebreak ]
In a second approach, we apply the recently developed hybridization- or PCR-based enrichment technologies, as outlined earlier. Then we sequence what we have enriched, and then at the sequence level, we assemble the fosmids and tile them into two different contiguous haplotype sequences.
The third approach — [looking forward] — is probably the most straightforward one, namely sequencing entire pools of fosmids directly. The second-generation sequencing technologies are supposed to double their throughput, roughly from 50 to 100 gigabases per run by the end of the year or early 2010 at reduced cost. We have extrapolated that once the throughput has doubled, we may be able to sequence 100 MHC haplotype sequences by just applying that approach.
How did you decide to use the SOLiD platform for the project?
Decision-making was very difficult. I had this very big checklist, and at the end, I decided for SOLiD because the Applied Biosystems team provides really a superb service — that was a major criterion. Secondly, it seemed the accuracy for variation analysis might be a little bit better, due to the two-base color code.
We have literally, within half a year, established our next-generation sequencing platform from zero. The machine was installed Oct. 31, 2008. We started with version 2 and upgraded to version 3 at the beginning of March. And within that half year, we produced 265 gigabases of raw sequence and 145 gigabases of mappable data. I would really like to acknowledge my team, led by Eun-Kyung Suk. It's the only all-women team in Germany on such a high-tech machine. They have done a superb job.
We have also established the data analysis pipeline from scratch. We first implemented the SOLiD standard secondary analysis pipeline, we adapted the analysis to the eight common MHC haplotype reference sequences, and then, most importantly, we developed specific fosmid analysis programs in order to meet the specific analytical requirements of the project. We have developed a fosmid detection program in order to filter signal from noise on the sequence read level. We have also developed a module to correlate the high-throughput SNP mapping data with the next-generation fosmid sequence data as a quality control, and we have developed the first basis of an MHC haplotype project database, which is currently only accessible to a few people who are working with it. I really want to highlight the work of Roger Horton, who came from the Sanger Institute, where he was part of the MHC Haplotype Project, and of Thomas Hübsch. They have done an excellent job at the bioinformatics level.
When will the project be completed?
We are geared to complete the MHC haplotype sequences by the end of May 2011, which is the end of the funding period.
What is the significance of this project?
Our approach to molecular haplotype sequencing is, at the moment, quite unique. The entire world, including the 1000 Genomes Project, all performs mixed diploid sequencing; we are the first ones, to our knowledge, to directly generate haploid sequences at a larger scale.
And our paradigm towards assembly of haploid sequences applies in particular to any highly variable region in the human genome. We human beings are diploid organisms, so the two haploid sequences of an individual are of key importance, and they are the very molecular basis to establish relationships between sequence, structure, and function. Ultimately, you need molecular haplotype sequences within defined disease-gene regions to pin down the causative variants and their functional implications in disease.
What about emerging sequencing technologies? Can you see anything in the future that could make your work even easier?
I would already be happy if the throughput doubled to 100 gigabases or more per run at reduced cost. That would carry us through the 100 MHC haplotype sequences. Not only could we sequence all MHC haplotypes, but we could also think about sequencing entire haploid genomes — that's our ultimate goal and the best you can do with this haploid reference resource.
For that, we would want longer reads and easier de novo assembly because we cannot rely on the reference sequence anymore in structurally variable regions. Also, optimization of matching algorithms and new approaches for using short-read data; even more thoroughly tested SNP calling and variation calling algorithms, including for small indel polymorphisms; and for the SOLiD, reads in base space and much larger insert sizes, like in the hundreds of kilobases, to cope with structurally highly variant regions.
For the long-term future, I personally would like to get rid of the entire front-end — that is, all these template-generation approaches. I would like to see single-molecule DNA sequencing or nanopore sequencing. I remember when I was in [George Church's] lab from 1992 to 1994, George already had a first project on single ion channel sequencing in the lab, where he had this vision to move an entire chromosome through a single pore and just measure the single bases by current changes. Basically, I would like to see such technologies where you can just sequence an entire chromosome within milliseconds and don't have all this labor with amplification of template and assembly of short reads.