Associate Professor, Expression Genomics Laboratory
Institute for Molecular Bioscience, University of Queensland
Name: Sean Grimmond
Position: Associate professor, Expression Genomics Laboratory, Institute for Molecular Bioscience, University of Queensland, St. Lucia, Australia, since 2008
Experience and Education:
— Head of Expression Genomics Laboratory, IMB, University of Queensland, 2001-2007
— Group leader, Queensland Institute of Medical Research, 1999-2000
— (Postdoctoral) research fellow, MRC Mammalian Genetics Unit, Harwell, Oxford, UK, 1997-1999
— Research officer, Queensland Institute of Medical Research, 1994-1997
— PhD in pathology, University of Queensland, 1994
— BS in animal sciences, University of New England, 1987
Sean Grimmond founded the Expression Genomic Laboratory at the Institute for Molecular Bioscience at the University of Queensland in Australia in 2001. In his research, he uses transcriptome information to study genetic networks that control developmental processes and define pathological states.
Last fall, his laboratory was one of the first to receive an Applied Biosystems SOLiD sequencer and obtained a second instrument at the end of last year. In May, Grimmond and his colleagues, in collaboration with ABI scientists, published a paper in Nature Methods in which they used the SOLiD system to profile to transcriptomes of stem cells.
In Sequence talked to Grimmond by phone last week while he attended an ABI user meeting in Barcelona.
When did you first get interested in using high-throughput sequencing for transcriptome analysis, and how did your collaboration with Applied Biosystems come about?
My lab has been involved in studying the transcriptome at many levels for more or less the last 10 years. We have been using microarrays, high throughput in situ screening, and my lab has contributed a lot to the [Functional Annotation of the Mouse] consortia. From very early on through to the current projects going on at FANTOM, we have been involved. That really got us into sequence tag technologies, working with [Cap Analysis Gene Expression] and ditag data.
As part of the FANTOM 3 project, where the aim was to look at transcription complexity, or just how many transcripts were coming from each locus, we really got to a situation where, if we were finding three, four, or five transcripts coming from each locus, we wanted to then put those into a biological context. Rather than knowing that across 500 cDNA libraries, you might get five or six different transcripts, we wanted to have some way to measure individual transcripts in biological states.
So we dabbled there with arrays, exon arrays and junction arrays, and found that that was not optimal. And then we moved to CAGE and [Massively Parallel Signature Sequencing], which we knew would help at the 5’ and 3’ ends, and then we were saying, it would be great to try a shotgun transcriptome approach to start to see if we could measure that complexity.
Towards the end of 2006, I was fortunate enough to catch up to Kevin McKernan [at Applied Biosystems]. They had just started to describe the SOLiD technology. We had all the pre-existing infrastructure that we had built, based on the FANTOM data, and now we wanted to use that prior knowledge with the genome sequencing.
I was in Boston for an NIH meeting — I am involved in a large program making a molecular transcriptome atlas of the developing urogenital system — and I figured I would look him up. We pretty much hit it off from there.
Tell me briefly about your recently published study in Nature Methods — what did you set out to do, and what were the most interesting results?
There were [several] aims to the project. One was, really, from a genome biology point of view, and a methodological point of view, we wanted to see what insight we can derive from massive-scale very short tag sequencing. The question we wanted to address is, ‘Can we measure gene activity more sensitively than what we see with our current gold standards, which would be our standard microarray platforms?’
The second question was, ‘Can we identify transcript-specific expression, so if you have three transcripts coming from a locus, can we now quantify each of those?’
The third question was, ‘What biological insights can we gain from looking at an experiment?’
In this case, we were looking at mouse [embryonic stem] cells, the undifferentiated cells, and then we had a look at ES cells that had been pushed towards mesoderm development, a classic developmental switch that you can do in a dish. And then we had a look at what was happening to major sets of genes that are involved in ES cell maintenance and differentiation.
And the final thing we wanted to have a look at was, what sort of insights could we get from the sequence itself, rather than just looking at the transcripts.
The bottom line from the paper was that we could identify a large number of genes, and indeed, our sensitivity was much higher than what we see on arrays. We were seeing 25 percent more genes that we could measure robustly by the sequencing, compared to what we see on the array platforms, where they are below the level of detection.
We were able to show that we were seeing large numbers of individual transcripts being expressed, which, depending on the pathways that we were looking at, could double or even triple the actual transcripts that come from those pathways. This basically means, if a pathway, like TGF-b signaling, has 25 known components, once you look at complexity, we now may be talking about 75 components to that model. So we can really start to try to factor how complexity might be involved in that biology.
And then, we certainly were able to show that we could detect SNPs within the sequence. We could do that in the mouse model system, which gives us a lot more confidence to move on to pathogenic samples.
The final thing I would say is that this is a wonderful opportunity for transcriptome discovery. What we were seeing is, about a third of the transcripts did not match known exons. When we sequenced to 10 gigabases in ES cells, … we were seeing a large amount of that expression clustering in places we had never really looked at before, say repeat elements. We have always ignored [these elements] in genomics on our arrays; now we could actually measure [them], and we know that about 250,000 of them are actively expressed, and about 30,000 of them showed dynamic expression as cells differentiate. It’s a whole new area of the genome that we have not really had a chance to look at before.
There is also a large amount of sequence associated with ultra-conserved regions, where there has not been prior transcriptional evidence, as well as sequences that are closely associated with known genes, which gives us more opportunity to look at gene discovery.
What are the main advantages of the technology over arrays? Also, what were the limitations; what can you still not detect with this methodology?
Compared to arrays, we completely outperform them with respect to sensitivity, reproducibility, and quantification. We get a lot of validation of biological replicates as well as technical replicates. Our correlations were 0.94 for biological replicates and 0.99 for technical replicates in the library making and sequencing process, when we wanted to quantify gene expression. And when we did qRT-PCR across about 80-odd transcripts, our correlations were about 0.86. And that’s across the entire dynamic range of transcripts, and we are getting below the level of sensitivity on the arrays there. In that respect, if you wanted to completely characterize a transcriptome, this is the way to do it.
Where I think arrays are critically important, and my lab is constantly doing arrays in a lot of other experiments, is that my lab can run 100 arrays in a week, but we have no way that we can run 100 samples through next-gen sequencing and do all the bioinformatics.
So the throughput between the two is a very big difference. If I want to analyze gene expression data from an array, I can do it on my laptop; if I want to analyze 10 gigabases of transcriptome sequence, I need parallel computing and some real expertise to make sense of that.
I guess another big deficit would be that the approach that we used to identify unique transcripts works well for human and for mouse because there have been fantastic efforts to characterize the transcriptomes previously. We have got ESTs, we have got FANTOM, and we have got [ENCyclopedia of DNA Elements], and those have defined transcripts, and we can use that to work out how many transcripts are likely to be there for the known transcriptome, and where the junctions are, and what sort of variants we might see. If you are working in a species where there really was not much transcriptome work, some of these approaches would be more challenging.
Another disadvantage would be that all these technologies, at the moment, using RNA-Seq will require a well-constructed genome. Arrays I can do on pretty much anything, if I make a cDNA library from it, or I do some sequencing of the transcriptome and then jump in. But really, for you RNA-Seq, you would be wanting to have the genome as well.
And the last one, I would say that your ability to study the transcriptome with short tags means that, if there are variants, if you have a combination of splicing events and alternate promoter usage that has not been seen before, you cannot necessarily put those two together, because your read is not long enough. So that’s still a challenge.
Other RNA-Seq studies were published over the last two months or so, involving the Illumina Genome Analyzer. How does your study compare to these; what’s similar and what’s different?
There have been a couple of papers in yeast, and some other work, recently, in mammalian systems. I think one of the big differences in our system is that rather than making a double-stranded cDNA, as some of them did, and then shear it and sequence it, we used a method that is directional, so that we maintain strands, and that’s very important in the high eukaryotic genomes, where sense-antisense expression occurs a lot, and we wanted to be able to tease those out with confidence.
A second thing is depth. The scale that we are using, if you have a look at the volumes of data, we are using well in excess of 100 million reads to do these experiments, and we believe that, if you want to completely characterize a transcriptome, you are going to need scale. SOLiD has the advantage that you can crank more out, as in the number of independent reads, so that is quite useful.
Barbara Wold’s group [at Caltech] had some really nice work … looking at novel splicing, and they certainly chased that with their data more so than we have. That’s another advantage of this approach that we are starting to actively chase.
Do you have other next-generation sequencing instrumentation in your lab, and did you consider other platforms before you decided to bring the SOLiD in house?
We only have SOLiD at our institute, but I have worked with the FANTOM colleagues, using 454 and Solexa data, for quite some time. We made a decision not to go after 454 when we started considering next-gen sequencing because we were so convinced by scale. We were attracted to SOLiD because of the collaboration we got going with those guys early on, and that’s certainly worked to our favor.
What other projects are you using the platform for now?
You name it, we are giving it a crack. We have completed a similar study now on human stem cell transcriptomes. We are moving into cancer cell line transcriptomes. We are also moving into genome sequencing for structural variants in cell lines and tumors. We are now really getting to [use] mate pair libraries, and looking for structural variation. We are also doing a more thorough characterization of the transcriptome. The methods that we originally used were not looking at the microRNAs and the small non-coding RNAs. We are now actively pursuing those guys as well.
So we are certainly moving from some of the classical biological model systems that we have worked on previously, and more towards pathological states.
Can you mention what have been the greatest challenges in using this platform — especially given that you are one of the earliest customers?
We got our first machine at the end of October, and we got our second machine in December. The first machine was generating sequence within three weeks of arriving, the second machine was generating data the day after it arrived. We found that when things are working well, and we made sense of some of the methodology, the generation of the data is quite robust. The big challenge, for us, was the bioinformatics.
We worked with MPSS data and with CAGE data, but we might be talking about a million tags for deep CAGE. Now, when we start talking about 100 million tags, that means that some of the computing that you thought was high-performance may be lacking. We certainly had to invest in new parallel computing, which would really handle this scale of data in a timely fashion.
Was dealing with data in color space a challenge?
That’s an interesting question. The mathematicians in the group had absolutely no problem with color space. I think the biologists were a little daunted by it. But once you start looking for SNPs, and [want to] be confident about those, then you see the advantage of teasing out systematic error vs. SNPs. The way color space works, in order for there to be a true change to the nucleotide sequence, there should be two changes in color space. And there are only certain combinations which are a legitimate change. When you get to use that, you very quickly strip out a huge amount of systematic error. So it gives you a lot more confidence when we are going chasing SNPs. Early on, it probably takes a good day or so to start to get your head around it.
All the mapping strategies we use were the ones coming from AB. They handle color space quite seamlessly. We certainly don’t know there is any issue there with color space.
I was thinking about challenges relating to software being available to analyze the data?
For mapping, we are using this software that comes with the machine, and it’s blindingly fast.
What aspects of the technology do you think could be improved in the future: on the sample preparation side, the chemistry, and the software and data analysis side? What is your wish list?
This meeting has been great for that, because everyone has been putting up their wish lists. I guess it’s like a model for the Olympics: we want to be faster, longer, stronger. We are currently working with sequences that are 35 base pairs; we would like to have longer reads. The speed of the machines is something that would be beneficial to speed up. If we are working with mammalian genomes and transcriptomes, and epigenomes, we require more scale per run. I guess the advantage of all those things, if you get them, [is that] it would also make it cheaper to do experiments, because these experiments are not cheap.
With chemistry, I would like to have data that has no errors, I guess. We still have got a way to go in that respect. With sample preparation methods, we have been using quite large amounts of RNA. We are talking about using, in the case of the ES cell work, about 50 micrograms of total RNA. Similar to what we used to use in the bad old days on a microarray, before sample amplification was invented. That would be fine for a lot of experiments; it would be challenging if we are trying to use developmental tissues or laser capture material or those sorts of things. I think there is room for expansion into that area.
And then, the other thing is, we need to do a lot more of this, because there are going to be biases associated with library generation and the ePCR and those things that we are only just starting to get our head around.
Is the complexity of the sample and library preparation similar to what you were used to from microarray work?
I think they are very similar technologies, because we are doing similar sorts of things to the RNA; we are using the same sorts of enzymes and things. You certainly have to be more conscious of artifacts that you may not see on your array because there are no probes to detect them. You now are going to catch all of them in their ugliness.
So we did a lot of work of cloning libraries, and then having a look at those with conventional sequencing, before we were convinced we had something that we should really throw at the next-gen sequencing.
What is your recommendation for researchers who want to acquire a SOLiD system — what should they be aware of or pay attention to before bringing this kind of technology in house?
For all the next-generation sequencing approaches, I believe that the sequencing is not the challenge. Certainly with the likes of SOLiD, the machines fire up quickly, and you can generate an awful lot of sequence very quickly.
What you need to do is have that workflow — and by that I mean going from the molecular biology through to the bioinformatics — well established. It’s really key to recognize that the machine is the head of the river, but you really need to have the other parts of that workflow. Otherwise, you are going to end up with gigs upon gigs upon gigs of data, and it’s going to be quite challenging to get some insight from that.