Name: Susan Celniker
Position: Head, Department of Genome Dynamics (since 2008), and co-director, Berkeley Drosophila Sequencing Program (since 1996), Lawrence Berkeley National Laboratory
Experience and Education:
Research fellow, then senior research fellow, senior research associate, Division of Biology, California Institute of Technology, 1983-1996
PhD in biochemistry, University of North Carolina, Chapel Hill, 1983
BA in biology and anthropology, Pitzer College, Claremont, Calif., 1975
Sue Celniker heads one of 10 research groups that participate in the Model Organism Encyclopedia of DNA Elements, or modENCODE, project. The effort, launched by the National Human Genome Research Institute in 2007 and funded with $57 million over four years, aims to identify all functional elements in the genomes of the fruit fly, Drosophila melanogaster, and the round worm, Caenorhabditis elegans.
As part of the project, Celniker's modENCODE group, which includes researchers at six other institutions, was awarded a $14.5 million grant two years ago for the "Comprehensive Characterization of the Drosophila Transcriptome." In Sequence recently spoke with Celniker, who heads the department of genome dynamics at Lawrence Berkeley National Laboratory, to find out what the project has achieved at its halfway point, and what role new sequencing technologies are playing in it.
Can you give a brief overview of the goals of modENCODE, and an update on where the project stands today?
The project was started by NHGRI to augment the human ENCODE project, which at that point had only focused on one percent of the human genome. They thought the next scale would be to study worms and flies, which are a thirtieth in size compared to the human genome, and then scale up to do the entire human genome. The other advantage of having worms and flies is, they are both genetic model organisms, so validation and testing of models would be significantly easier.
There are 10 groups that constitute the modENCODE consortium, with parallel groups for most projects in worm and fly. These groups study the transcriptome — including mRNAs, non-coding RNAs, transcription start sites, untranslated regions, and miRNAs —, regulation of transcription focusing on transcription-factor binding sites, chromatin marks, and DNA replication.
We just published a marker paper describing the data types being produced and our plans for data integration. Our data can be obtained from the modENCODE project website viewable in a browser or by download using FTP.
When the project started, most of the new high-throughput sequencing technologies were just out the gate. How are they being used in the modENCODE project today, and what advantages do they offer over, for example, microarrays?
The group that I head, for example, proposed to use microarrays with 38-base pair resolution to profile the fly transcriptome, and sequencing, at one-base pair resolution, is a significant increase in resolution. Most of our work in transcription profiling has been done using the Illumina sequencing technology.
So far, we have analyzed 24 cell lines by microarrays, and we have repeated four of them by RNA-seq. We have completed 31 developmental time points on microarrays, and are in the process of analyzing the same samples using RNA-seq. We will have an enormous amount of data to compare both approaches. We are just in the process of collecting the RNA-seq data this summer. We have 12 samples close to being captured at about 15 million reads for each state. We want to figure out whether we reach saturation or not, and we don't know that yet.
It's easier to identify splice sites and splice variants with the one-base pair resolution. One aim of our grant is to understand the control of splicing, a project directed by Brenton Gravely at the University of Connecticut Health Center. He is knocking down components of the RNA binding machinery using RNAi, and then sequencing the products to identify changes in splicing. All of his work is done by RNA-seq. It's very difficult to design a microarray that would capture all the different putative splice variants. We could not do that project, realistically, without having switched to RNA-seq.
We are also doing Rapid Amplification of cDNA Ends, RACE, and proposed initially to clone and sequence the products in order to identify transcription start sites. Now we have a pooled strategy where we can sequence hundreds of products. We have been using 454 for that, but we are planning to move to Illumina to compare the two, since 454 is more expensive than Illumina. It's been truly revolutionary, the amount of data we can capture.
[ pagebreak ]
What about ChIP-seq?
Gary Karpen at Lawrence Berkeley National Laboratory has done quite a bit of microarray work with chromatin-associated proteins, and I know they are switching to ChIP-seq. It's more expensive for them than doing ChIP-chip, but with the stimulus package, everyone has written supplemental grant applications, and NHGRI will make a decision as to whether the data from sequencing is important to capture.
How do you weigh higher cost against advantages that sequencing offers?
The cost is still higher to do sequencing than arrays, depending on the size of the genomes and the depth of sequencing, but we feel that the higher resolution is worth the cost. We are still in the early stages of evaluating the types of data that you get from both technologies.
What technical improvements to the sequencing platforms would you find most useful for the modENCODE project?
Obviously, length. We are doing 1x76 base and 2x76 base reads on the Illumina platform. Our goal is to identify complete transcription units, so the longer the read, the more confidence we have in mapping to the genome, especially across splice sites, and the easier it will be to generate complete transcription units.
Is there one preferred sequencing platform among the 10 participating groups as of today?
No — we also have a project with SOLiD. Roger Hoskins, who is part of our group here at LBNL, was awarded a second prize in Applied Biosystems' $10K Genome grant program, and ABI will sequence 12 of the developmental time points using the SOLiD platform. We will be able to compare those SOLiD data, which is stranded, to our other data. Strandedness is extremely important for disentangling overlapping genes. Twenty percent of the genes in the Drosophila genome are now known to be overlapping, and that number will rise with more complete annotation. So it's important to have the stranded information.
Do all 10 groups have access to next-gen sequencing?
Most of the groups do. More and more universities are getting machines, which are being added to most sequencing cores.
What are some of the most interesting biological results from your analyses of the data so far?
It's early days for our analysis efforts, but the most interesting to date is an effort led by Manolis Kellis at MIT to integrate the chromatin marks and expression data to determine different chromatin states in the genome. They have identified approximately 20 different states in flies. They will also integrate David MacAlpine's origin of replication mapping and Kevin White's transcription factor binding data. We then hope to have a picture of the chromatin landscape and understand how that changes through development.