Name: Ralf Herwig
Position: Group leader, bioinformatics, Department of Vertebrate Genomics, Max Planck Institute for Molecular Genetics, Berlin, since 2001
Experience and Education:
Scientific researcher, Max Planck Institute for Molecular Genetics, 1998-2001
PhD, faculty of mathematics and computer science, Free University Berlin, 2001
Undergraduate degree (diploma), faculty of mathematics and computer science, Free University Berlin, 1996
Ralf Herwig heads the bioinformatics group in the Department of Vertebrate Genomics at the Max Planck Institute for Molecular Genetics in Berlin. His team develops software and analysis tools for genomics and proteomics data, as well as methods for data integration. It is also involved in analyzing sequence data generated by the institute for the 1000 Genomes Project, which the MPI joined last summer.
On a recent visit to Berlin, In Sequence spoke with Herwig about his work and the challenges of handling second-generation sequencing data. An edited version of the conversation follows.
What kinds of research projects is your group involved in?
We have a large portfolio of experimental techniques that we mainly apply to extract information on human diseases. We have projects on cancer, which is one of our areas of focus, as well as on type 2 diabetes, neurodegenerative diseases, and Down syndrome. The institute organized the sequencing of chromosome 21 for the Human Genome Project — we have quite a long history in genome and sequencing projects.
My group is mainly involved in the bioinformatics analysis of these techniques. We have a strong relationship with the experimental groups, and we are very close to the data. For example, we developed methods to integrate and read out these massive amounts of data, as well as methods and tools for statistical analysis to get some meaningful information out of these data.
What experimental platforms do these data come from?
It's mainly sequence data, from either classical Sanger sequencing or from next-generation sequencing. Also, gene expression data from microarrays, and mass spectrometry data.
Can you mention a few projects you are participating in?
In Germany, I am coordinating a project on systems biology, together with Bayer-Schering and the German Cancer Research Center in Heidelberg, where we doing mutation analysis on individual tumors. It's a project aiming at personalized medicine, and what we can develop from the bioinformatics and the systems biology side to analyze personalized data.
We had another project on the EU level with the University of Innsbruck in Austria, which has a large prostate cancer databank with tissues and a lot of patient data, where we analyzed networks and functional modules that are predictive for prostate cancer and prostate cancer disease progression.
We are also involved in a very large EU project with more than 20 partners, called [A European Model for Bioinformatics Research and Community Education, or] EMBRACE , that aims to integrate major European databases and software tools in bioinformatics.
Last but not least, we are involved in the 1000 Genomes Project, which we joined last summer. We were not involved from the beginning because we did not have enough sequence capacity, but the German government provided us with some funding, so we could enlarge our sequence facility.
We currently have five Illumina sequencers, four SOLiD sequencers, and two 454 sequencers. I think we are the largest sequencing facility in Europe, beside the Sanger Center. We are also the only European partner in the 1000 Genomes Project beside the Sanger Center and EBI.
[ pagebreak ]
What is the MPI's contribution to the 1000 Genomes Project?
The 1000 Genomes Project is divided into two layers: the data generation layer and the analysis layer. We participate in the data analysis and data handling and development of new methods, but we are also involved in generating data.
The overall goal of the project is to provide individual sequence information for more than 1,000 people. Basically, each data generation center is allocated a certain amount of gigabases of sequence data to generate from different individuals. Our task is to provide 700 gigabases of data, which corresponds roughly to 100 of those 1,000 people. For the pilot project, we contributed 10 percent of that, so 70 gigabases. That was finished in November of last year.
Can you talk about the challenges you have encountered dealing with next-generation sequencing data?
The major challenge is that the amount of data generated exceeds developments in computing. Moore's law says that computing power doubles each year. At the moment, sequence data increases something like 10-fold each year.
We have quite a good bioinformatics infrastructure here — we have something like 400 CPUs, a Unix cluster sitting behind the sequencers. Once the image analysis is done on the sequencers, we transfer the data to our Unix cluster, but it takes a lot of time. This is the main challenge to data handling and methods development — the massive amount of data you have. It's not like opening a file and reading the data; it's really gigabytes of data.
Another thing is storage: We invested a lot of money in storage capacity. It's primary data, and good scientific practice says that you have to keep primary data to make the analysis reproducible, so we keep as much data as we can.
Do you keep the image files?
No. But you can reconstruct that from the data that we keep. What we keep is something like half a terabyte per run. A run takes something like a week, depending on the technology, so you can estimate how much space we need per year.
How much storage space do you have?
We have something like 300 to 600 terabytes, but of course we also have other projects. I think for sequencing, we have something like 300 terabytes. We will probably run out in a year.
What about data analysis?
Of course it depends on the application — we are involved in many projects. Our department had a Science paper last year, for example, where we performed RNA-seq. Compared to microarrays, for example, you have much more information, for example, on all splice variants. You need new methods for that, new methods to normalize these data.
Then, we have projects where you specifically enrich, say, parts of the genome that you are interested in, for example, the exons to look for mutations. It is known in cancer that drug susceptibility is dependent on mutations, for example. Our goal is to take a patient's genome, enrich the exome, look for mutations, and predict from this patient what kinds of drugs would be beneficial for him. That relates to personalized medicine, which many people are interested in, and we have several projects in that area.
Another set of interesting genome areas are methylation sites. We have projects where we look for epigenetic regulation, also in the cancer field, for example, in colon cancer.
[ pagebreak ]
What kinds of analytical tools for these kinds of applications do you develop on your own?
For example, for the epigenetic project, there is no real solution available, so we are really at the forefront. For RNA-seq, of course there are many methods known from microarrays, which might be adapted to sequencing data.
Most recently, we developed a splicing prediction method for microarrays, which we recently submitted for publication. We want to extend that to sequencing data.
But on a lower level of analysis, for example, mapping algorithms, we don't develop these ourselves. We take what's developed by other people.
We are interested in statistical ways of interpreting data, and then to map these data on networks of genes and analyze networks and disease-relevant modules. For example, in January, we published a paper on a database [called ConsensuPathDB in Nucleic Acids Research] where we integrated a lot of different types of human interactions, like gene regulators, protein-protein interactions, and signaling pathways. That's our main interest, not just to look for, say, sequence variants, but also to characterize these sequence variants and somehow judge whether they are disease-relevant.
What have the new sequencing methods contributed to these analyses that you could not do before? What's been their main advantage?
I would say the main advantage is also the main disadvantage. They give you overwhelming and comprehensive information. If you are able to intelligently organize and filter these data, you have the best set of information that is currently available from any genomic experiment. But of course, because you have so much information, it's really hard to filter out the interesting things.
Take RNA-seq, for example. On microarrays, which were the gold standard for 10 years, you have probes that are reporters for certain genes. What you basically can detect in your experiment is limited by what was present on the array. With sequencing, you get everything that is in your sample, and you get everything distributed across the whole gene. People usually observe now that you detect many transcripts that haven't been annotated in the databases. So what do you do with this kind of information? That's a nice but also a difficult problem.
What are you planning to use next-gen sequencing for in the future?
A very straightforward project is to basically sequence individuals with a disease background, and there are ongoing projects like cancer genome projects, which leads into the field of personalized medicine. This is the main application I see for the sequencing technology.