Name: Detlef Weigel
Position: Director, Department of Molecular Biology, Max Planck Institute for Developmental Biology, Tübingen, Germany, since 2001
Experience and Education:
Assistant and associate professor, the Salk Institute for Biological Studies, 1993-2002
Research fellow, California Institute of Technology, 1989-1993
Research associate, University of Munich, 1988-1989
PhD in genetics, Max Planck Institute for Developmental Biology and Tübingen University, 1988
MS in biology, University of Cologne, 1986
Detlef Weigel's group at the Max Planck Institute for Developmental Biology in Tübingen has been using next-generation sequencing since 2007, when it was one of the first laboratories to obtain an Illumina Genome Analyzer.
The lab, which is now equipped with a GAII and a HiSeq 2000, has been using sequencing to study genetic diversity in plants, which has included the sequencing of plant genomes, transcription factor binding site mapping, and one-step mutation identification. Weigel is also one of the leaders of the Arabidopsis 1001 Genomes Project, which aims to sequence 1,001 strains of the model plant Arabidopsis thaliana (IS 10/7/2008).
Last month, In Sequence visited Weigel and his team in Tübingen to find out more about how next-gen sequencing has transformed their work. Below is an edited version of the conversation.
How long have you been using next-gen sequencing?
We were one of the first users of the Illumina technology. We got one of the first 100 instruments in early 2007. That first instrument got constantly upgraded, and we later bought a second GAII, and then we traded in the first Genome Analyzer for a HiSeq 2000 last year.
Although the sequencing facility is a core facility, it's mostly used by my lab — over the past five years or so, my lab has been using it 90 percent of the time. Most of the sequencing that hasn't been for my lab has been for collaborators, local and throughout Europe. More people would use it; it's just that the analysis is difficult. In the past, for Affymetrix microarrays, it was a lot easier to enable people to do their own analysis. One thing we have done is set up the CLC Bio workstation, so that outside users are enabled to do at least some basic analysis of their data.
Two people run the facility, who have developed protocols for different library types, but much of the library prep is done by students and postdocs. In the last six months, a major push has been to develop very large insert libraries for our de novo assembly projects. Illumina has no protocol for libraries larger than 5 kilobases, but Roche has a protocol for 20-kilobase libraries, so most Illumina labs use essentially the Roche approach for making large-insert libraries.
We sequence quite a few plant genomes with sizes between 150 megabases and a gigabase or so, and there, the large-insert libraries are extremely useful. You need to have really different insert sizes; just the small inserts and then one very large insert library is not good enough because the reads are so short.
We have distributed the primary analysis of the runs when they come off the machine. It's done by two people who take turns extracting the primary reads. The data are stored on a hard disk and then transferred to our servers, where we have 120 terabytes of space to process the data.
With the upgrade to the HiSeq, the major issue has become the back end. Within the next six months, we will have an additional 200 terabytes of storage, and the plan for the next 12 months is to go to 500 terabytes.
In 2007, when we had the first GAII, this was not really something we had to think about a whole lot. Storage now adds significantly to the cost of sequencing — somewhere between 5 percent and 10 percent of the consumables cost. Let's say you do somewhere between 50 and 100 HiSeq runs per year, so we're talking about approximately half a million euros in consumables, and then €50,000 for the storage. Before, it was negligible, and now, it starts to be a real cost.
Are you considering outsourcing data storage to a cloud service?
They are too expensive, and you have huge data amounts, and data transfer is a bottleneck.
What kinds of bioinformatics tools have you developed, and what role does bioinformatics play in your lab?
We have developed a whole software package called SHORE [for "short read"], which has all kinds of functionalities. This is the main platform we use in the lab to process data. It basically covers all the processing steps from the raw reads to the final SNPs.
There are 28 people in my lab, and of those, six are real bioinformaticians. There are probably another six to 10 people in the lab who are fairly versed in using software tools, so about half of the lab has at least some bioinformatics skills.
What would you say have been the greatest challenges in implementing this technology?
I think there is still a lot of confusion in the field about understanding the errors that the Illumina technology generates. The sequencing errors are easy to understand, but a lot of errors occur in the downstream processing, and that's something that, I would say, a surprisingly large fraction of the field does not handle very well. Quite a few of the things that we have been seeing that have been controversial — like RNA editing, for example — it's not the technology per se that leads to the confusion, but it's the analysis of the raw data.
We have always been very conservative in terms of interpreting the data, and also, I think different from many other labs, we have placed a lot of emphasis from the beginning on validating variants with Sanger sequencing. We still do this, and I find it rather amazing how many labs do not use Sanger sequencing to validate their results. This includes assemblies; we have just generated Sanger shotgun data to validate whole-genome assemblies.
When we first used Illumina technology to assemble a genome and then did some Sanger validation, the Sanger was the smaller fraction of the total cost of the project. Now, of course, the Illumina technology generates such a ridiculous amount of data that when you do Sanger validation, the cost of that will typically be many times the cost of producing the Illumina data.
What are the main applications for which you use next-gen sequencing?
Initially, resequencing was our main application, mostly of plant genomes. We have used it for transcript analysis, both for measuring expression levels but also for discovery of splicing patterns; ChIP-seq; we do a lot of bisulfite sequencing analysis now; small RNA analysis; and then de novo assemblies.
Initially, we developed a reference-guided assembly, where we use a reference genome and then fill in the parts that are not covered by local assembly, and now we do complete de novo assembly.
We have also been using sequencing for looking at binding specificity of transcription factors, random binding site selection, basically.
A very important part is genetic mapping and reduced representation sequencing; determining genetic relatedness of individuals without sequencing the entire genome.
Sequencing technologies are advancing rapidly. What kinds of new technologies are you most interested in?
This is a matter of intense debate, let's say. Overall, our experience with Illumina has been very positive, so currently, we are discussing whether to get a MiSeq, mostly because it looks like it's going to be able to generate quite long reads.
We actually also have quite a bit of PacBio data that we obtained through collaborators. There, we're still on the fence, let's say. Currently, PacBio sequencing is very expensive, and at least the data from the initial C1 chemistry were extremely error-prone. We are right now looking at C2 chemistry data, and we're evaluating it for whole-genome assembly, but the price per data point is very, very high, essentially completely uncompetitive. So it will be an initial application where we will be better off outsourcing it.
Obviously, the Ion Torrent looks attractive as well, and if Oxford Nanopore turns out to hold water, that would be very attractive. But right now, nobody has seen any real data, so I think it's difficult to know.
I would say that the developers of sequencing technologies have become a lot more conservative. When you look at Helicos, little of what they promised came true. PacBio, they were also quite optimistic, I think they will eventually reach what they had envisioned three years ago, but it's slower than they had expected. I'm hopeful that Oxford Nanopore will be even more realistic in terms of what they are promising.
One of the things that I'm very happy about with Illumina is that they have been pretty conservative in terms of what they promised. Sometimes, this has been annoying, because it has been difficult to plan things because in the end, it progressed faster than what they had indicated.
What role does NGS play in your research, and how much has it transformed your work in the last few years?
My own research is in plant genomics. We actually produced the first HapMap in plants a few years ago with this crazy technology called microarrays. That was a collaboration with Perlegen Sciences; we used microarrays where we had 1 billion different oligonucleotides synthesized for interrogating the Arabidopsis thaliana genome. From there, it was a logical follow-on to start resequencing.
We showed early on that the Illumina technology worked very well for Arabidopsis thaliana-sized genomes, and that was an important driver to then advocate the 1001 Genomes Project, which is similar to the 1000 Genomes Project for humans. We published the first installment last year, which was just based on 80 genomes (IS 8/30/2011).
What we published last year was based on data that we generated two years earlier. And it was not because it took a long time from submission to publication; it was just that it took a long time to analyze the data. And it wasn't the analysis of the primary data; it was … really the downstream analysis that took a long time.
It's often difficult to decide when to do the experiment, but also, whether to generate data anew. We have data we generated several years ago at an enormous expense, compared to what you could do now with a much smaller expense and often at higher quality. But of course you have to make a decision at some point when you actually use the data.
Can you provide an update on the Arabidopsis 1001 Genomes Project?
Right now, there are about 500 genomes available that one can look at. About 200 have been sequenced by Magnus Nordborg and his colleagues at the Gregor Mendel Institute of Molecular Plant Biology, about 200 have been sequenced by Joe Ecker and his colleagues at the Salk Institute, and maybe another 30 or 40 or so by others.
Right now, my lab, the Nordborg lab, the Ecker lab, and the lab of Joy Bergelson at the University of Chicago are waiting for final delivery of another 500 genomes from Monsanto. The company has generated the data in house in order to validate its own pipeline and is making it available for free to the academic community. Certainly before the end of this year, we are going to have over 1,000 complete Arabidopsis genome sequences.
When are you expecting to publish more results from this project?
The first HapMap that we generated a while ago was the basis of an Affy genotyping SNP, which was used to type 1,200 or so genomes. Now, we are going to have these full genome sequences of 1,000 individual plants, and the bottleneck is actually more working with the biological material.
For phenotyping, and connecting genotype and phenotype, it would be preferable to use the strains for which there is the full genome sequence available. But there is this other collection of strains that is already easily available from the stock center, which was genotyped with the Affy array. There were fewer strains available at the time, and we knew much less about genetic diversity, so they were not chosen in the best possible way. So you would like to use the latest set, but people, of course, already invested a lot in phenotyping.
What would you say are the most interesting insights you have so far gained from sequencing plant genomes?
One of the first applications that we used the Illumina technology for was to determine the spontaneous mutation rate. This was something that was not possible before because with conventional sequencing technology, you need to have extremely low false positive rates. It's certainly interesting that the raw error rates of these next-generation sequencing technologies are quite high compared to Sanger sequencing, but because you can generate so much sequence, you can make that error go away.
I still find this amazing. When we looked at these mutation accumulation lines, there were 20 mutations per genome, and each genome is on the order of 100 million bases, so that's one error in 500,000 bases analyzed. Our false positive rate was essentially zero, and the false negative rate was very small, on the order of 20 percent; we estimated that we caught about 80 percent of our mutations. We have more recently resequenced some of these lines, and it turned out to be a pretty good estimate. This was with 36 base pair single-end Illumina reads. The early Illumina technology was perfect for this because the mutation accumulation lines were derived from the reference strain, for which there is an extremely high-quality Sanger sequence, which made it relatively straightforward. Personally, I think this is one of the things that I'm really quite proud of.
The other thing that is important to me is forward genetics. A few years ago we developed a method that uses sequencing directly to map new mutations and also to identify the mutations in the same sequencing reaction, and that has been completely revolutionizing forward genetics. It's becoming so cheap that you can think about genetics in a really different way.
A lot of genetic screens have been used pretty much to exhaustion. If you are interested in a certain phenotype, you would not just continue previous screens because it's so much work to map the mutations, and if 90 percent of them are in known genes, then it would be quite wasteful if you had to map them by a conventional method or just even doing the complementation process. But now you only need something like 3 gigabases of HiSeq sequence data, so basically, you can map and find a new mutation for something like $100 or so.
In the old days, four or five years ago, it would take six person months to do this, more or less, and the cost of mapping a single mutation was anywhere between $10,000 and $50,000. Now you can do it for $100.
What this mapping by sequencing also means is that for any organism where you can do genetics, you can find genetic variants through forward genetics. As long as you can produce a segregating population, you can essentially find the genes by fine mapping.
The other side of this is that sequencing is making genotyping extremely cheap. When we started to do large-scale genotyping on the Sequenom [MassArray] platform about eight years ago or so, we were looking at prices of 10 cents to 20 cents per data point, and these were already very good prices. Now, you can easily get 10,000 data points for $10, so one data point for 0.1 cents, so you can genotype a very large number of individuals for very little cost.
How is next-gen sequencing going to contribute to plant genetics going forward?
I think we are going to see genetics with many more organisms. Before, you had to develop a genetic map first, which was a hurdle that has disappeared because you can produce genome sequences quite simply. You had to discover the SNPs; that's not an issue anymore. Then you had to develop a platform to monitor those markers, which was based on specific strains. Now, with the sequencing, you don't have to develop the platform, it doesn't matter what the markers are that distinguish those strains.
What that means for agriculture is, if you have relatives of crop plants where you have interesting traits, if the traits are segregating in a species, you can set up crosses and you can get to these traits. Disease resistance, for example, is one of the most interesting, or most obvious, traits. Or let's say you have a plant that makes a particularly interesting chemical, and there is variation for this trait in that species, you can go after the genes responsible for this trait, or you can do a mutagenesis and try to knock out the ability to produce that chemical and find that gene by genetic mapping.