Name: Jan Korbel
Age: 36
Position: Group leader, Genome Biology unit, European Molecular Biology Laboratory, Heidelberg, since 2008
Experience and Education:
Postdoc, EMBL, 2008
Postdoc, Yale University (Mark Gerstein's group), 2005-2007
PhD in molecular biology, Humboldt University, Berlin, 2005
Undergraduate degree in biotechnology, Technical University Berlin, 2001
Four years ago, as a postdoc in Mark Gerstein's group at Yale University, Jan Korbel published a groundbreaking paper in which he used a new sequencing-based paired-end mapping strategy, developed in collaboration with 454 Life Sciences, to analyze structural variations in two human genomes at high resolution (IS 10/2/2007).
Since then, Korbel has moved on to the European Molecular Biology Laboratory in Heidelberg, where he is a group leader in the Genome Biology unit. During a recent visit to EMBL, In Sequence spoke with him about his current work to elucidate the origins of structural variations in the human genome and their role in disease — in particular cancer. Below is an edited version of the conversation.
Tell me how you apply high-throughput sequencing in your research today.
We are a laboratory here at the EMBL that has both a wet lab as well as a number of people doing computational analysis. The way we use next-generation sequencing is twofold: We contribute to large projects where sequencing data are being generated, for example the 1000 Genomes Project, where we are not doing data production but merely data analysis.
But we also have various additional projects where we produce and analyze our own data, or where we participate in sequencing samples together with collaborators. Notably, we are involved in three projects within the framework of the International Cancer Genome Consortium [on pediatric brain cancer, prostate cancer, and lymphoma], where our primary goal is to study the molecular alterations that drive particular tumors.
Do you have your own sequencers, or do you use the EMBL genomics core facility?
The essence of the EMBL system is that large equipment is usually shared. We are a very heavy user of the genomics core facility, run by Vladimir Benes (IS 12/6/2011), which has five Illumina sequencers and which most people engaging in sequencing at the EMBL use. These are shared by groups on a first-come-first-served basis, which works really well. The turnaround times are short.
I'm aware of the possibility of submitting samples to sequencing centers, such as BGI, which many groups at the EMBL do to some extent, especially for large-scale projects. I have to say, though, that there is a huge advantage in having a good and robust and strong genomics core here in house, because we can always play with the protocols, we have very short turnaround times — as short as two weeks at the extreme — which lets us act accordingly when we have certain findings. So most of the sequencing we do, including for the International Cancer Genome Consortium, will be run here at EMBL.
What new approaches to detect structural variations are you working on?
Structural variation is still inherently difficult to detect. The situation has dramatically changed compared to 2006, when array platforms were used to detect structural variations. They often were not sufficient in resolution to delineate the precise boundaries of these variants, and therefore, provided useful information that was however also a bit limited in telling us which genomic sequence was actually encompassed by the variants. To infer [their] impact [completely], it's really crucial to understand with high precision and accuracy where structural variations occur. This situation has improved [with next-generation sequencing].
What still needs to be improved is the detection and analysis of structural variations in more complex regions of the human genome. The present situation is that sequencers can sequence human genomes very fast, but the most widely used sequencers still use fairly short DNA reads that will not map into more complex areas, segmentally duplicated areas. However, those complex areas are the ones that have the most structural variant content, and they also appear to undergo structural variation formation at high rates. Some hotspots are in such regions that are inherently difficult to analyze.
We are trying to improve this situation by doing methods development, both computationally and experimentally. The latter mostly entails using a combination of short-read and long-read technologies, plus a combination of different paired-end mapping strategies with different insert sizes, and integrating these data to approach, essentially, the assembly of more complex regions, or at least the accurate delineation of structural variations in these regions.
What long-read technologies do you use for that?
At the moment, we are piloting the use of PacBio's strobe read sequencing technique for that. In addition, we are collaborating with Complete Genomics on their long fragment read technology. Those are the main two approaches we are presently pursuing, but we are very open toward other techniques that would be complementary to those. I still see a lot of potential for improvement there. [For example], I see lots of potential for optical mapping to be complementary to sequencing; and also for new sequencing techniques that may be coming up in the future, such as Oxford Nanopore's approach, which might enable long-read sequencing.
How are you contributing to the ICGC projects?
The contributions are not the same for all three projects. For instance, in the pediatric brain tumor project [called PedBrain], which we conduct in collaboration with the German Cancer Research Center, the Max Planck Institute for Molecular Genetics in Berlin, and the University of Düsseldorf, we coordinated a pilot study that finished very recently. We are now moving into the first large-scale phase with that project, which is directed by Peter Lichter at the DKFZ (CSN 12/7/2011).
For the pilot study, which was led by us at the EMBL, we did the genome sequencing, using Illumina sequencers and a regular paired-end sequencing strategy with a short insert size, but we also used a paired-end mapping strategy with a longer insert size. That is comparable to the strategy we published in 2007, but it does not use 454 but the so-called Illumina mate-pair protocol that uses circularization and achieves fairly long insert sizes, enabling us to be very sensitive in detecting structural rearrangements.
What did you find in the pilot study?
The findings are very exciting, but I can't really talk about them. I think they will be published sometime early next year. Just to give you a very vague overview, we found a surprising number of structural rearrangements; more than had been initially seen with microarray-based studies on this tumor entity.
Generally speaking, what role does structural variation play in cancer, and how can this knowledge be exploited in cancer treatment or diagnosis?
From what we see in pediatric brain tumors, their prognostic impact is very high. And we have observed that in another class of tumors as well. There are specific structural variations that associate with overall patient survival that can be three or four times lower than for tumor patients with the same tumor that don't have this set of structural variants.
In the childhood brain tumor medulloblastoma, structural variations had previously been associated with diagnosis, but we could extend that work. Previous observations, published a few years ago, were among the primary motivations for us to contribute to that field.
Do you believe similar results will emerge for other tumor types, or are they very tumor-type specific?
Structural variations are tumor-type specific. We have analyzed additional tumors — prostate cancer and leukemia — and have found a similar importance of structural variations in those. But there are also other tumors where structural variation appears to not play such a strong role.
What's the focus of your more basic research?
We really try to answer the question, 'How do structural rearrangements in the genome occur?' using both experimental approaches, where we look into particular model systems, but also largely computational approaches where we sift through large datasets and try to understand where and in what context structural rearrangements occur, for instance in the context of repeat sequence; how they are oriented towards each other; whether they are complex or simple. From these observations, we try to infer which molecular mechanisms were involved in forming them.
What have you found out so far about how structural variations come about?
With a large number of collaborators from the structural variation analysis group of the 1000 Genomes Project, we published one finding earlier this year in Nature. What we report in there is that there are over 50 regions in the genome that we call 'hotspots of structural variant information.' But they are more than that, they are hotspots for a particular formation mechanism, of which there are different ones that operate in the genome. Slippage, for instance, can lead to the expansion or shrinkage of variable number of tandem repeats; that's a replication-associated structural variation class. Mobile elements move around in the genome, but we found they don't cluster so much, they appear to be uniformly distributed. A mechanism that very strongly clustered is a recombination-based mechanism called non-allelic homologous recombination. Actually, 80 percent of the hotspots we identified in the genome are hotspots of recombination.
Is there any other project that you would like to mention, that you are excited about?
I'm excited that Complete Genomics recently proposed to contribute a large number of deeply sequenced genomes to the 1000 Genomes Project (see related story, this issue). One possibility, which is being discussed among participants of the 1000 Genomes Project, is to sequence parent-offspring trios deeply.
That would be very exciting — it would add many deeply sequenced genomes to the project. There would be some overlap with genomes sequenced with Illumina [technology], which means different sequencing technologies being pursued on the same genomes, which will not only affect SNP and indel mapping, but in particular also structural variant mapping. The trio information would be highly valuable to verify structural variations that we observe. Also, [it will help] to detect rates of formation of structural variants, and to use the complementary information from Complete Genomics and Illumina, if available, to become better at ascertaining structural variations and complex regions of the genomes.