Co-Director, Genome Sequencing and Analysis program
Name: Chad Nusbaum
Position: Co-Director, Genome Sequencing and Analysis program, Broad Institute, since 2001
Experience and Education:
- Research scientist, Whitehead Institute/MIT Center for Genome Research, 1996-2001
- Project leader, Mercator Genetics, 1995-96
- Postdoc in plant genetics, University of California, Berkeley, 1991-95
- PhD in biology, Massachusetts Institute of Technology, 1991
- AB in biology, Columbia College of Columbia University, 1983
With more than 100 ABI 3730 sequencers, 16 Illumina Genetic Analyzers, three 454 Genome Sequencers, and one of the first ABI SOLiD platforms, the factory-style sequencing center of the Broad Institute is at the cutting edge of sequencing technology. Formerly the Whitehead Institute/MIT Center for Genome Research, the sequencing facility was one of the main contributors to the Human Genome Project and has been involved in the sequencing of numerous other organisms.
As co-director of the Broad’s genome sequence and analysis program, Chad Nusbaum oversees technology development projects as well as a number of microbial and fungal genome projects. In Sequence visited Nusbaum last week and spoke to him about how the institute has been implementing new sequencing technologies.
What types of new sequencing platforms do you have at the Broad Institute, and what kinds of projects do you use them for?
We have three 454’s, two of which have been upgraded to the FLX.
Those are used for bacterial and fungal sequencing, especially for bacterial genomes where there is some kind of cloning bias problem. One big advantage that 454 offers is that the project goes quickly, and there are a lot fewer steps. You don’t have to make a couple of plasmid libraries and plate them out and pick a lot of clones and then sequence them. It’s just, ‘Take the thing, grind it up, dump it in the sequencer,’ and you have data in two days. A bacterium seems to fit very nicely into a run or two of 454.
Setting aside cost for a second, from a process management perspective, it’s an awful lot easier to do a large number of small projects on the 454 platform. It’s logistically a lot simpler than doing it by traditional means.
We are expecting to do a large number of bacteria over the next year to several years. We are working with other NIH groups as part of the human microbiome project (see In Sequence 6/19/2007).
We also use 454 for some tricky profiling projects that are impractical to do by Sanger sequencing because of the complexity of the samples — like population profiling — but that aren’t suited to something you can do with the short-read technologies.
What short-read technologies do you have?
We have 16 Solexa machines. Solexa gives you an awful lot of cheap data, and the question is, what can you use these data for? Because in principle, the cheaper the data, the more things you can do. But because the reads are short, what you can use them for is limited by the challenge of the length of the read.
Now, anything I am saying that we do with Solexa, I see no reason why we wouldn’t be able to do with the ABI SOLiD machine. We have an ABI SOLiD machine, and it’s still in the testing phase, so we don’t have any real data from it. But in principle, it seems to be trying to more or less tackle the same problems as Solexa.
From a cost basis, ideally, we want to do everything on short-read sequencing platforms. Of course that’s technically unsolved for a lot of these things. But for some applications, it’s very straightforward.
For example, we recently had a paper in Nature (see In Sequence 7/3/2007) where we profiled chromatin states using chromatin IP with antibodies to modified histones [using Solexa sequencing]. That’s basically a counting, or tagging, application.
We thought this was a great first application for us to be doing Solexa sequencing on. It was sort of a test bed for machine development and for process-management development. We built a whole [laboratory information management system] around this, and we figured we might as well be running samples that will be informative.
However, the data from this paper actually doesn’t represent a very large amount of the data we generated on the Solexa platform. Most of the data that we ran through Solexa for quite a while was just E. coli and other control samples, over, and over, and over again, optimizing different conditions, so we could figure out how to run the machine better.
Solexa sequencing is also great for SNP discovery in bacteria. It’s tremendously good for that. The first thing we did was a proof-of-principle study. We had two finished isolates of Mycobacterium tuberculosis, and we knew there were about 800 SNPs between them. And we asked, ‘How many of these SNPs can you find by Solexa sequencing, and how accurate is the result?’
We took the drug-resistant isolate and generated data from a lane of Solexa sequencing. After hard filtering for quality, that yielded about 15-fold coverage of the genome. We used a very simple neighborhood quality standard algorithm where we filtered the reads really just for intensity; we used intensity as a proxy for quality. We then aligned those filtered reads back to the reference genome using fairly stringent filtering. We discovered 98 percent of the SNPs, and we discovered zero false positives. We also had zero false positives when we tried to discover SNPs on the strain that we sequenced.
It turns out that the other 2 percent of the SNPs, [which] we did not discover, were in regions of the tuberculosis genome that were greater than 80 percent GC content. Tuberculosis was maybe not a great choice to start with because it’s high GC, and all of these new sequencing technologies have some problems with extremes of GC content. We think we have some ways to ameliorate that. But my point is, we can explain stuff that’s missing.
We were also not able to call SNPs in regions of the genome that were in repeats of lengths greater than 36 bases, which is the length of the Illumina/Solexa reads we used.
The other thing that can be different between genomes is insertions and deletions, and there were something like 50 insertions or deletions between these two genomes, and we found 95 percent of those, including two that were larger than the reads. And the ones that we missed were, again, either in these regions of greater than 80 percent GC content, or in one case, we missed a 1,100-base deletion. We were not expecting to find that one. We did find a couple of deletions in the 50- and 60-base range.
So that was a terrific, exciting result, and we said for the cost of one lane of Solexa sequencing, we can call all the SNPs we want in a modest sized bacterium. So anytime we are going to do this, we will do this by Solexa sequencing.
We have also done this to find base changes in close isolates that have been rapidly evolved under selective conditions. You apply selection to a bacterium, and you can use this to find the exact location of the base changes. And that works. That’s a great application of Solexa.
We can also do this in fungi, and we have done it in stickleback. We haven’t done it in mammalian-sized genomes yet, but it should work there. It’s just a question of whether that’s the most efficient way to get those data. We want to do SNP discovery in humans as well, and this can be done by targeted sequencing.
A particular application that we are excited about is doing SNP discovery in cancer. Cancer is tricky, of course, because you have a mixed population of cells. And that’s something that historically has been done by Sanger-based sequencing of PCR products. You can’t realistically see SNPs below 25 percent [that way] — some people say 10 percent, but it doesn’t matter, there is a lower limit, and it’s noisy. With any of these new technologies, because they isolate the strands of DNA, they are immune from the problems that happen when you lose your signal in a mixed population.
What has your experience with the ABI SOLiD platform been?
In detail, the Illumina and the ABI platforms are very different, because the ABI technology is something completely new. But the application space looks very similar. They give you reads of similar length, they just get them in phenomenally different ways.
I’m optimistic about the ABI machine; I am very excited about it. We are working very closely with ABI.
I should say we are working very closely with all of these companies. I am delighted with my relationships with all of them. They have all been very forthcoming, and working with us on development projects, and sharing with us stuff in advance, and they have been tolerant of the fact that we are also working with the other companies. I have a serious interest in being part of making all of these things succeed. It’s a win for me, it’s a win for them, and it’s a win for the field at large.
What are your plans for using the 16 Illumina sequencers?
We are continuing to do epigenomic studies like the chromatin structure work we recently published. We are starting to look at a variety of aspects of cancer. And we're planning a wide range of microbial projects.
[We also plan to use Solexa] sequencing to do human genetics. Association studies, gene discovery, any time you want to do large-scale resequencing. The number of samples you want to handle makes Solexa an appealing option. There are too many disease loci with good association scores to count, and then you can’t find the gene in your megabase of sequence. You can’t find the gene, maybe, because you don’t have statistical power. Maybe it’s not the same mutation in all of your patients. And how will you know, maybe it’s not an exonic mutation. If you could tile that region with PCR products, pool those PCR products, and sequence them by Solexa in 100 or 1,000 patients, then it gives you an opportunity to get statistical power to narrow things down. It’s just too hard to prove for us by the old-fashioned method.
Are you planning to test any of the platforms that are still in the works? For example, Helicos BioSciences is just down the road from the Broad Institute.
I have been talking to Helicos for a long time, and know those guys very well. I like them a lot. Yes, as soon as they have a machine available to place here, I would be very excited to check it out.
What are your criteria for testing new sequencing platforms?
The machine has to produce usable data in a way that looks like it’s competitive with the state of the art.
At some point, also, I am going to have to balance my resources. We have a pretty serious staffing commitment to each of these machines. That’s obviously not a permanent situation, and it’s not scalable. If there were five or seven different new machines here, we wouldn’t be able to give them all the same commitment. And in fact, we are doing a lot less development with the 454 machine now because it basically works. The data that we generate from 454 is really production-type data.
The Solexa machines are working pretty nicely now but there is still good technical development we think we can do, either on our own or working closely with Illumina.
Where do you see the greatest challenges for new sequencing technologies regarding data handling and bioinformatics?
One is the sheer volume of the data. We have a group of people who are just focused on handling the volume of the data in increasingly efficient ways. A Solexa run generates half a terabyte of raw data. It’s almost impossible to do anything with that much data fast, and yet, we have to. For example, we don’t move the data very often. We take the data off the machine once, then we process it where it is, and then we throw it away, because storing the data costs almost as much as keeping it.
I had a conversation with one of these vendors, and they said, ‘We’ll save all the data.’ And I and some other people said, ‘But, how much data is it? And how much does it cost to buy hardware to store those data?’ And the cost of the storage was starting to compete with the cost of generation of the data. And even if you could store it, indexing and retrieving it is so inefficient that we — and I think everybody else — decided that it’s just not worth it, you can’t do it. So we have to learn to live with not having archival data. We have every 3730 read we have ever done. We can go get that trace. That’s not exactly the raw data, but it’s close to it. We are not going to be able to do that with the new sequencing technologies.
There are two blocking issues to doing everything we do with capillary data with new sequencing technology data: molecular biology and algorithms. These two things are linked, in the sense that the algorithms have to understand the molecular biology, and the molecular biology has to be able to feed the algorithms.
I am prepared to argue that the reason we can’t assemble a mammalian genome from 35-base reads is not because 35-base reads don’t carry enough information, it’s [because] we are not delivering them in the appropriate way to the algorithms. We don’t necessarily have good enough algorithms now, but we are working hard on it and making a lot of progress.
With regards to molecular biology, specifically, this has to do with sample prep. If you want to assemble a genome, you need pairs of reads. The shorter your reads are, the better your pairs have to be in terms of how accurately you space them, and how well you know this spacing. And perhaps, you need a series of these. For example, you would have to be able to make 40-, 50-, or 100-kb pairs; that’s the sort of thing that we used to glue together the mouse, the dog, and the human genome. We don’t know how to do that yet, and you want to do it very accurately. We have some ideas that we are working on. But then of course, once you have been able to get those libraries made, and sequenced in the machine, then your algorithms have to make maximum use of them.
You always have to face the problem with the short read [of] how do you align it accurately, and how do you know you have aligned it accurately in the right place? Those are not glib, Zen questions; they are serious issues, and they are not trivial to resolve.
How far away do you think we are from assembling a mammalian-size genome from short reads?
I’d like to do it by next year — I don’t know how realistic that is. It’s hard to predict that. But of all the progress that’s being made, this is a big goal; it’s one that the NHGRI is pushing us toward, and it’s one that we all want to achieve. And I and all of my colleagues in the sequencing centers and in the sequencing companies are working toward this.
Where do you see sequencing technology going in the next few years?
The cheaper it gets, the more we can do. But of course cheaper often means that you have to work harder to maximize the data.
One interesting problem that we are facing — and, again, this is something that a lot of people have realized — is that these machines generate so much data that it’s not easy to feed them. If you can find all the SNPs you want in a bacterium with one lane of Solexa sequencing, you can do eight of those in one run, and with one Solexa machine, maybe you can do 16 a week. That’s an awful lot of bacteria to find that you want to work on. The sample acquisition doesn’t scale, in a sense.
If you want to sequence exons, you are talking about hundreds of thousands of exons that you could be dumping into a single or a small number of Solexa runs. That’s a daunting amount of PCR. I would very much like to not have to do huge amounts of PCR to do large numbers of exons from different patients. So we, and other groups as well, have recognized that this is a major blocking issue. Because we know how to make sequencing work, and we know how to write algorithms, even if it might take a little time. But handling these large numbers of samples, it’s a new problem for us, and there are, hopefully, clever technical solutions that will work, and there are some promising ones. It’s such an important topic right now, in the next year or two it has to get solved.
What will the landscape of sequencing platforms look like in the near future?
The installed-base landscape, you might call it, is going to be a lot more dynamic. We had one type of machine here for five years. We were running nothing but 3730s, and we were very happy to have a single platform running in our shop.
And before that, we were running nothing but 3700s, and we switched over so we were never really running both at the same time. And before that, there was the 377s.
That’s no longer sensible. Because first of all, you are going to do different things with different platforms. Second of all, there is a sort of development curve that you go through with the new technologies, and there are too many now. So we have got to be running all of them, or as many of them as we can afford to. And it’s my expectation that this is the way sequencing is going to be for a while. I don’t see, in the next couple of years, that we would want to run only one machine. It’s easy to see that different machines are suitable for different applications.
I don’t see the 3730 going away very soon, there are a lot of things that we can still only do with that, and we have our obligations that we have to fulfill with that. But at the same time, it’s balanced against the new technologies. And there are other, newer technologies that will be coming, and the question is, how do those fit in, what’s the disruptive technology paradigm?
I think people are open-minded about learning about new technologies. So if somebody invents a new machine, people are going to try it. And if it’s good, they are going to switch over. I don’t know how long this incredibly exciting time will last, but for a few years, these technologies are going to be in pitched battle. It’s a long race, and people are going to be starting late and catching up. It’s a very aggressive, competitive environment. The users of the machines and the consumers of the data are the big winners.