Skip to main content
Premium Trial:

Request an Annual Quote

CSHL's Dick McCombie on How to Handle Half a Terabyte of Sequence Data


W. Richard McCombie
Cold Spring Harbor Laboratory
Name: W. Richard McCombie
Title: Professor, Cold Spring Harbor Laboratory, since 1992
Age: 52
Experience and Education: Senior Staff fellow, National Institutes of Health, Laboratory of Molecular and Cellular Neurobiology, 1988-1992 (working with Craig Venter in the Receptor Biochemistry section)
PhD, Cellular and Molecular Biology, University of Michigan, 1982
BA, Biology, Wabash College, Indiana, 1977

Dick McCombie’s group at Cold Spring Harbor Laboratory was involved in sequencing the first plant genome, Arabidopsis thaliana, and has been developing methods to determine the structure of complex plant and animal genomes.
McCombie, whose group has had a prototype Solexa Genetic Analyzer since last December, recently received one of the first grants from the National Center for Research Resources to buy a production version of the instrument, now sold by Illumina. His grant originally said the lab would buy a 454 platform.
Last year, McCombie, who prior to joining CSHL worked with Craig Venter at the National Institutes of Health, helped organize the first CSHL course on next-generation technologies, and is planning another one this fall. For almost a decade before that, he ran a course on the use of Sanger sequencing.
In Sequence caught up with him last week to talk about the opportunities and challenges of using next-generation sequencing technology.
How did you make the decision to acquire an Illumina — formerly Solexa — Genetic Analyzer, rather than the 454 instrument that you originally asked for in the grant?
The grants go in quite a long time in advance. There really was not much information available about [the Illumina instrument] a year ago. From the things we did with the prototype, we are very enthusiastic about it; it’s working. Having said that, we are still struggling. It’s a new instrument, and it’s not like an ABI 3730 where you get runs in the lab, and you don’t hear about it [afterwards]. We are spending a major, major effort on how to use this instrument.
In terms of why we made the decision, it was almost exclusively based on the cost of operation of the instrument, and the cost of sequencing, coupled with the fact that we were going to use it primarily for resequencing. I think if we were not using it for resequencing, we might have very well made a different decision, because of the read length. But for resequencing known things, the accuracy seems very good to us, and the data quantity versus cost is different enough [from] anything that’s on the market.
The ABI [SOLiD] instrument looks very interesting to me [as well] but it was not available. We actually checked prior to placing this order but [ABI] just could not make a delivery in the timeframe that we wanted it.
[Compared to] the things out there, this [instrument provides] enough of a change in throughput that it’s going to totally revolutionize some areas of genomics. It’s just a completely different world when you can literally talk about, for instance, low-coverage sequencing of a person for some tens of thousands of dollars, instead of tens of millions.
I was giving a talk on some applications of it a few weeks ago, and somebody pointed out some limitations of the data, and I said, ‘The cost is so low [to generate] so much data that my view is [that] you come up with ways to use that data rather than worry about the problems.’
It’s very analogous [to when] I was working on the ABI instrument in the late 80s, and there were a lot of things that it would not do, either. It got better …and the Solexa platform has already gotten better, even in the time that we have had the prototype. We got better at ways to use it and avoid the problems. And we are still in that learning stage for now with the Solexa, I have to say.
How does the instrument perform in your hands?
When it’s typically running, most of the stuff works. It does not all work to the optimum on all lanes always, and we are working to increase that consistency [from run to run], but when it does, and on some of [the runs] it does, we get well over a billion bases per run. The read length is fixed, you fix it at either 27 or 36 [bases]; we [recently] switched to 36-base reads.
What have been the main challenges?
It makes data files of somewhere in the area of half a terabyte. And the network here is just not fast enough to move that around. We bought a 16-processor Linux cluster to analyze the data, and it still takes a long time to do those analyses, over a day. Cold Spring Harbor has a high-performance computing cluster, and running it on 100 nodes of that cuts the processing time down to six or eight hours, which is a big improvement. [But] we right now are having a very difficult time getting that data to that computing cluster, which is in a different building on campus. Neither us nor they are on the main campus, so the data actually has to route back to the main campus, and then out to this high-performance cluster. We are working pretty constantly with the IT department here on doing that.
That’s been the main issue, actually. And that’s not really an instrument issue, it’s just the facilities here. That’s slowed us down, certainly. We can do it, it’s just not as seamless, and it takes a lot more effort than we would like. It slows down troubleshooting and testing new things, because the runs themselves [take] several days, and then it takes sometimes days to move the data around and do the analysis. We will solve that, it’s just [that] people weren’t prepared for that kind of data.
In terms of the science, what projects have you done, or are you planning to do, on the instrument?
My lab is operating the instrument, and we are working with people on a number of projects, in addition to other labs who are using the instrument for various projects.
Things that we are interested in are largely resequencing [projects]. We do have a grant from the [National Science Foundation] with Blake Meyers and Li Liao at [the University of] Delaware, and Rod Wing at [the University of] Arizona, to study de novo sequencing using mixed data types. Rod has a 454 [Genome Sequencer], and this actually had some influence on our decision to go with the Solexa, I suppose. Li is a computer scientist and is working on algorithms to [analyze] mixed data of different data types. The proposal of the grant that we have is essentially to oversample some multimillion-base regions of rice — different wild rices, not the one that has been sequenced — using all three platforms — ABI 3730, 454, and Solexa — and then try different assembly algorithms and strategies on subsets of that data to determine the cost minimum for [sequencing] a new genome by mixing the datatypes.
For instance, a very low coverage on long ABI 3730 reads, and then maybe some 454 reads, and then very high coverage on the Solexa, which is cheaper but does not assemble as well because of the read length. We have just gotten some of the regions sequenced with the ABI, some with the 454, and Li is working on those assemblies, and just in the last few weeks [we] have added the Solexa data.
We are [also] interested in cancer and certain cognitive diseases, working with the people here at [Cold Spring Harbor] Lab and other places, looking at genes that we think might be involved in some of these diseases. The goal is to sequence either large regions, or multiple regions, from a large number of individuals, and we are doing things to optimize the instrument for that purpose, to make it easier to pool samples and so forth. We are also working on [ways] to more readily select portions of the genome that we want to sequence.
I think, ultimately, what I’d like to do, and we have not done too much on it yet, is … things like expression analysis and even copy-number analysis. I think this instrument actually can provide an interesting way to do things like expression analysis, and I think has some distinct advantages over chip analysis. [You can count transcripts], and also, the thing with chips is, if you use an array to determine expression, you need to know in advance what you are looking for. That is a limitation of that technology.
What do you expect will happen this year with regard to the new sequencing technologies?
I don’t know. I think everyone is kind of waiting to see what happens with the ABI [SOLiD platform]. I have seen their prototypes, and it looks like a pretty exciting instrument. I have known some of the people working on that instrument at ABI for years, and they are really top-notch engineers and scientists, so I have fairly high expectations from that.
Tell me about your course on ‘Revolutionary sequencing technologies and applications’ this fall. Which technologies are you planning to include in this course?
We are still working on that. We have gotten fairly good commitments from multiple instrument manufacturers to participate, and have their machines available for the students. What we are trying to do is — something we could not do with the last course because we just had one instrument type [454’s Genome Sequencer] — is to show how their different characteristics make them better for some things, and how they can be most optimized for certain applications.
We have come from a period where there was one sequencing instrument, realistically, the ABI 3730 [or] 3700, and that’s really been the phase we have been in for close to 20 years. And I don’t think that exists anymore. I think there is going to be a multi-platform world, so we are hoping to convey that, [so] the students can learn and understand for their projects what’s the best technical way to approach it. The other aspect of that is, we want to show that there is some science that you really could not do before that you can do with these instruments, and I think that’s the most exciting thing about it.

File Attachments
The Scan

Unique Germline Variants Found Among Black Prostate Cancer Patients

Through an exome sequencing study appearing in JCO Precision Oncology, researchers have found unique pathogenic or likely pathogenic variants within a cohort of Black prostate cancer patients.

Analysis of Endogenous Parvoviral Elements Found Within Animal Genomes

Researchers at PLOS Biology have examined the coevolution of endogenous parvoviral elements and animal genomes to gain insight into using the viruses as gene therapy vectors.

Saliva Testing Can Reveal Mosaic CNVs Important in Intellectual Disability

An Australian team has compared the yield of chromosomal microarray testing of both blood and saliva samples for syndromic intellectual disability in the European Journal of Human Genetics.

Octopus Brain Complexity Linked to MicroRNA Expansions

Investigators saw microRNA gene expansions coinciding with complex brains when they analyzed certain cephalopod transcriptomes, as they report in Science Advances.