Name: Vladimir Benes
Title: Head, Genomics Core Facility, European Molecular Biology Laboratory, Heidelberg, since 2001
Experience and Education:
Postdoc, biochemical instrumentation unit, EMBL, 1994
PhD in molecular biology, Institute of Molecular Genetics, Czechoslovak Academy of Sciences, Prague, 1991
Undergraduate degree in biochemistry, Charles University, Prague, 1985
Vladimir Benes started GeneCore, the genomics core facility at the European Molecular Biology Laboratory in Heidelberg, in 2001. The facility provides a variety of genomic services — including massively parallel sequencing, microarrays, and qPCR — to researchers at the EMBL and elsewhere.
On a recent visit to EMBL's main campus in Heidelberg, In Sequence spoke with Benes about how GeneCore has implemented high-throughput sequencing, and the challenges of offering this technology in a core facility. Below is an edited version of the conversation.
What technologies does GeneCore offer, and who do you serve?
The Genomics Core Facility at EMBL is primarily an in-house technology-driven service facility providing access to today's technologies, which include massively parallel sequencing, microarrays, qPCR, and some liquid handling. It's not only about processing samples but also about complete support of users, including experimental design, assistance with troubleshooting, and data analysis.
Users not only include researchers at EMBL Heidelberg but also at EMBL outstations. Also, [the European Molecular Biology Organization] maintains several programs to support researchers in its member states, who get access to the core facilities. The primary objective is to satisfy EMBL users, but if capacity permits, we are able to accept samples from outside.
We also play a consultant role for member states considering the acquisition of massively parallel sequencing technology. We participate in piloting for them and people visit us to take a look [at the instrumentation].
GeneCore always works with research groups. It is not allowed to pursue any biology-driven project; it needs to be associated with technology. But of course, as some of the applications users are interested in, like single-cell transcriptomics, are pushing the limits, we work together to make things happen.
How are you currently equipped with sequencing technology?
Currently, there are three [Illumina] HiSeq 2000s and two GAs, but one GA is scheduled to be returned to Illumina before the end of November.
We are also now in a testing phase of Life Technologies' Ion Torrent PGM because we see that possibly the PGM could suit a particular group of users looking for sequence verification, something that would require quick access to data.
We are aware of the limitations of the amount of data produced [by the PGM], but this instrument doesn't pretend that it can match the Illumina [HiSeq 2000] capacity. Illumina delivers a very large amount of data, but in a relatively long time, and there will be a stage when, for the verification of results coming from the Illumina platform, the PGM seems to be suited.
We have also been assessing other platforms out there, but at the end of the day, these two are here and working.
Due to the nature of a core facility, the emphasis is on the production of data for users. They come to us with their samples to get data back very quickly. We have a certain amount of time available for assessing and testing new things, but the objective is to fit the EMBL user base.
What types of applications do you mostly run, and what organisms do you analyze?
The types of samples we receive are very broad. We get a high number of ChIP-seq samples, a high number of RNA-seq samples. Genomic DNA or exomes are not so abundant yet, which is due to the nature of [research at] EMBL.
The range of organisms is relatively large. It's a bit of human but also Drosophila, Arabidopsis, Xenopus. There is one extremely interesting project where we participated in sequencing the genome of Platynereis dumerilii, a ragworm. It's an organism that's extremely popular in the evolutionary developmental biology world. There is a genome sequence that is fairly advanced and close to completion, but there is also RNA-seq data for different tissues of this animal, and projects to do [chromatin] immunoprecipitations for certain features of this organism.
How do you store sequence data?
GeneCore has 120 terabytes of storage space at its discretion, which is dynamic space that is used for the storage of data coming off the sequencers. After datasets are released to users, they have four weeks to copy them over, and then we remove them from our disks.
These 120 terabytes seem to be a lot, but there is overlap and you need some buffer. For example, it takes roughly one week to process data from one HiSeq run, running it through the complete pipeline.
We do not provide any long-term storage for our users. But for that, EMBL has prepared a solution: there is tier 2 storage capacity, which is also better priced for long-term storage.
We are now looking and considering possibilities for 'going cloud,' but it's still too early to tell how that's going to develop.
What kind of software do you use to analyze the data?
Illumina's pipeline and [its aligner] Eland have been improving and maturing very well, so for the mapping, in most cases, we use this for whatever reference genome is available.
For the secondary data analysis, it's a mixture of solutions. EMBL is in a very special position in that there is a very thriving and active bioinformatician community working on solutions, or preparing software packages for the data analysis of sequencing data.
But as a core facility, one also needs a solution that is covered by support and help. For that, we have teamed up with [German bioinformatics firm] Genomatix. There are different levels of experience among users. There are people who are command-line savvy, and they have no problem to work in that environment, but there are also users who [require a graphical user interface], and for those, Genomatix is a good solution. It has also been developing and maturing, and I think that the capabilities it offers to users are adequate most of the time. The limitation is that it provides only a defined set of organisms which are annotated to use the full set of functions it offers. So if someone works on de novo [data], they won't be suitable.
How much bioinformatics support do you offer to users?
They get three datasets: the sequences as such in FASTQ format; if applicable, aligned data; and a BAM file, which provides information about number of reads per position. We try to provide data that are clean, so they don't get contaminated data.
Two bioinformaticians working at GeneCore provide guidance to users, say, to identify appropriate solutions for the analysis, or training and coaching for novices. They also organize practical courses for people interested in data analysis.
Generally, users are responsible for the analysis of their data. There is no capacity available within GeneCore, or EMBL, to cover all that's coming out. So biologists should train themselves to be able to at least assess the data, to see whether they got something useful. For the very detailed analysis, they may possibly need to team with a biostatistician or bioinformatician.
What kind of migration have you seen so far from microarrays to next-gen sequencing?
The immunoprecipitation studies have essentially all gone to sequencing. Very occasionally, some people ask for ChIP-chip as a means to verify the sequencing data, but it's rather rare.
I believe that the second application that will go completely to sequencing will be microRNA profiling, mainly due to the fact that this space is relatively small, so with the level of multiplexing due to increased capacity, it is becoming accessible, and also solving the biases. The reads required are relatively short, about 30 bases. To prepare libraries for sequencing and for hybridization is not that much different [for microRNA profiling], and the time required to obtain data is also relatively short. We think about PGM as a tool for microRNA profiling. Our hope is that the new chip, the 318, which should yield a couple of million reads, should be sufficient for the microRNA-ome.
For RNA-seq, the situation is slightly different. I think people who are able to carry out bioinformatics analysis themselves tend to favor RNA-seq over arrays. However, where this competence is not completely there, people still feel more comfortable with arrays. For expression arrays — we use Affymetrix and Agilent — we even see a 20-percent increase of array usage compared to last year. It's a bit surprising, but the pipelines for analysis of RNA-seq tools are not yet that robust. Not in the sense of their performance — they can deliver the datasets or the positional information — but the interpretation of the data is much trickier than for the array. The picture that arrays provide, for better or worse, is clear and there are users who feel more comfortable with getting that picture.
What have been the greatest challenges in running next-gen sequencing in a core facility?
The biggest challenge I have seen is the constant flux of the methodology provided by the companies, the constant improvement of kits, which may be real or may be not so innovative at the end of the day. But it's also about how the information about these changes is disseminated into the community. Sometimes there is a new version flow cell, but the information that you can only run it if you update the software of your instrument comes too late. We experienced a situation where the two were incompatible, but we learned about it only after the flow cell was already in the instrument. It's getting better, but still, there is this adage that 'the only constant thing in life is change.'
I think for a core facility, this is very difficult because it is also about legacy. Because if the data we deliver now are better somehow than they were a year or six months ago, does it mean the old data are useless? It's difficult to explain; this is a challenge in the interaction with users.
The fact that the output has increased so significantly means that people really can get the coverage they require without compromise. It's actually quite striking when we plan.
When Jan Korbel joined EMBL [as a group leader] two years ago, he considered sequencing the human genome to pursue a project, and asked me to prepare the budget. At that time, we needed four GA flow cells to get close to 30x coverage required. Now, this can be delivered in three lanes of the v3 flow cell. This is amazing. It suggests that people now have the possibility to interrogate their space of interest comprehensively and under fairly affordable conditions.
What kind of competition do you face from service providers, like BGI or GATC Biotech? What's the advantage of an in-house core facility vs. outsourced services?
I think the biggest advantage perceived — and I hope shared by our users — is the ability to talk to us and to interact with the guys who prepare the libraries, who run the instruments, who are able to tell them, 'This amount is too little,' or 'This is too dirty.' I think this direct interaction, and the possibility to work out and sort out problems, is invaluable and priceless.
The other important aspect is that we are not restricting any user; meaning, if the sample meets our requirements on quality and quantity, we will process it. It can even be only one sample — it's not bound by volume or anything like that, in principle. However, even EMBL users are free to go out and use other services if they feel like it. Of course in that situation I should know why they do it, but I think it doesn't happen that often.
I think that competition of this sort is even welcome. For example, for ChIP-seq, Illumina's protocol requires 10 nanograms of DNA. We can do it from 1 nanogram, and if someone says, 'GATC can do it from 500 picograms,' I would try.
The fact that users are required to pay essentially only for running costs [at a core facility] makes it also more attractive than going to a company.
Do you expect in the future that more labs will purchase their own desktop sequencer, now that they are becoming less expensive, so that some of your work may migrate to individual labs?
That's a tough question. I think that for research labs working on higher eukaryotes, desktop sequencers will not fully satisfy their needs. Certainly, they can deliver for laboratories working on prokaryotic systems. However, the systems may be easier to operate, but there is still the library preparation step, which is a multi-step, complex protocol. Things can go wrong. From my experience with GeneCore, the fact that users don't have to worry [about sample prep], they can bring their sample and get it done without spending time and effort, means they will favor a core facility over having it in the lab.
Do you expect the sample prep front end to become easier in the future?
We hear from Oxford Nanopore that [sample prep] might be straightforward [for nanopore sequencing]. You just flush the system with nucleic acid of your interest and it will go. I think that even PacBio, which offers single-molecule detection, requires certain sample handling and size selection steps. It is not that you pipette in your total RNA and you get only messenger RNA results back. I think for the foreseeable future — a year or two — there will be refinements, optimizations, maybe simplifications of the protocols, but the principal workflow will stay. Even more so for RNA-seq, because of course it's a misnomer, it still requires you to prepare cDNA, with all its issues and biases. So with Helicos [no longer selling its instruments], there is no system even pretending that it can sequence RNA directly.
How do you keep up with innovations in sequencing technology? How do you decide when to bring in a new platform, and what platform?
We strive to have feedback from our users. Active feedback in the sense of, 'I have this sample that current technology is not able to process for me, is there something out there?' So we do what you may call horizon scanning, and try to entice companies that may have promising new technology to provide access to it for assessment. I try to watch this field very closely. Also, since I know, by and large, what most of our users are interested in, I may also help make it happen.
Is there anything else you would like to mention?
It's important for prospective users to really think carefully about the reasons why they decided for massively parallel sequencing. What is lost occasionally is that it is a tool — a very rapidly developing tool, a maturing tool — but it's a tool. It is still relatively time-consuming, and for ordinary users, it's a relatively expensive technology. Not only on the money side but also on the time invested in the data analysis, to find out after six weeks that it was for nothing. It is powerful; for discovery, it's absolutely unparalleled in its power, but it should be used with good experimental design in mind.
So what I emphasize to our users coming to us for the first time is, 'Think about what you think you are getting out of it, and why would you do that?' For example, some people dismiss arrays as a dead horse when I think for some applications, arrays are a perfectly adequate tool to obtain a result that is sufficient for the project you want to do. As we are not immune to fancy and hype and fashion, occasionally, this is lost.