Advanced Genomics Technology Center
Name: David Duggan
Title: Director, Advanced Genomics Technology Center; investigator, Genetic Basis of Human Disease Division, Translational Genomics Research Institute, since 2003
Experience and Education:
- Associate investigator and head of microarray unit, Genomics Section, Genetics and Genomics Branch, National Institute of Arthritis and Musculoskeletal and Skin Diseases, 2000-2003
- Research Fellow, Cancer Genetics Branch, National Human Genome Research Institute, 1998-2000
- PhD in human genetics, Department of Human Genetics, University of Pittsburgh, 1997
- BS in biochemistry, College of Arts and Sciences, Temple University
David Duggan runs one of two genotyping centers at the Translational Genomics Research Institute in Phoenix, Az., using technologies from Affymetrix, Illumina, Sequenom, and Applied Biosystems for a variety of genotyping studies. Recently, TGen acquired an Illumina Genome Analyzer in order to integrate high-throughput sequencing into its experimental designs. In Sequence caught up with Duggan last week to find out how next-generation sequencing can help him and his colleagues get to the root of genetic causes of disease.
Tell me about your work at TGen. What kinds of genotyping studies do you conduct, and how are you equipped?
I am the director of the advanced genomics technology center. TGen actually has two genotyping centers. I run one, and Dietrich Stephan runs the other one. The two centers differ in disease focus.
When I first came here about four and a half years ago, we recognized the need for several different genotyping technologies and have since built what we call the end-to-end solution for genotyping. We are able to do whole-genome association studies with technologies from Affymetrix and Illumina. We are able to do fine mapping/candidate gene studies with technologies from Affymetrix, Illumina, Applied Biosystems, and Sequenom. And we can do fine-mapping studies with individual SNPs or insertion-deletion polymorphisms with technologies from Illumina, Sequenom, and Applied Biosystems. And then, of course, we always have the PCR-RFLP-based analyses that we once in a while have to do as well.
Between the two facilities we have several Affymetrix systems in house, we have several Illumina systems in house, and we have a complete Sequenom system. We have two ABI 3130s for microsatellite analysis. We have two ABI 3730s for dideoxy-based sequencing; we do lots of it. And then we recently acquired, about six months ago, the Illumina Genome Analyzer.
One of the reasons why my lab has not been able to take a lead on the Genome Analyzer here at TGen is because I’m almost overcommitted with genotyping projects. We are performing three genome-wide scans in the next four months, I have got three candidate gene studies going on, and all of these are SNP-based enquiries.
On what basis did you decide to purchase the Illumina sequencer?
You have got to remember, we made this decision back in March. At that time, it was either 454, now Roche, or Illumina. There were really only two products that were commercially available.
We were in contact with Applied Biosystems in regards to their SOLiD system. Helicos had been in contact with us about their HeliScope. But to cut a long story short, we had a need, and that need was to begin thinking about and incorporating next-generation sequencing technology into our experimental designs. So the decision was really Roche vs. Illumina at that time. We didn’t want to wait nine months for the SOLiD system and we did not want to wait a year or more for the Helicos system.
But we did not make this decision solely on convenience. We were pleased with some of the specs coming out of the Solexa [now Illumina] system. That is, the ability to do up to one gigabase of genomic DNA resequencing. The fact that run times on the machine were measured in three days was an advantage as well. For the HeliScope, I think the expected run time was significantly longer. Also, the required amount of input DNA [for the Illumina Genome Analzyer], between 0.1 and 1 microgram, was compatible with our intended experimental designs.
Finally, I think the big factor for me, and I came in late on this decision, was the cost per run.
The cost per run on the Illumina Genome Analyzer is around $3,000 to $4,000. We were being told by Helicos, for example, that their cost per run would be significantly higher. [Illumina’s cost per run is] reasonable, and it’s at a price point where we can find funds from elsewhere, other than NIH grants, to do pilot studies. And that could form the basis of bigger projects, bigger grants going forward.
But it wasn’t any one of these metrics, it was more a combination of the above, plus [other] metrics.
How is next-generation sequencing going to change the experimental design of gene or genome association studies performed at TGen?
For both the candidate gene association studies as well as the whole-genome association studies being performed today, many of the experimental designs are based on data coming out of the International HapMap Project. That project has been a hugely successful endeavor, and has certainly propelled the industry forward.
That said, there is new data coming out, for example from the paper by Levy et al, the first complete sequence of a diploid genome (see In Sequence 9/4/2007). And what we see there is [that] there is a lot more genetic variation than is captured in the databases today. In the Levy et al. paper, of the 4.1 million genetic variants they identified, 31 percent were novel, meaning they were not found in the current databases.
There is a huge positive here for the International HapMap Project, and that is, 85 percent of the SNPs that Levy et al. discovered were found in the same database — dbSNP — used by HapMap. That means our coverage of single nucleotide polymorphisms in the database is pretty good, so any experimental designs making use of the HapMap data are going to have a lot of power.
What the Levy et al. paper also discovered were genetic variants that are not going to be easily accessible by SNP genotyping. For example, they discovered a lot of insertion and deletion polymorphisms. They also discovered some inversions and segmental duplications.
We have known there are other genetic variants besides SNPs which can cause disease, and I think the paper by Levy et al. really opened up everybody’s eyes as to how much more there actually is. And we need to rethink our experimental designs, or at least modify them, going forward. The initial modification will incorporate, in my opinion, next-generation sequencing.
At least today, the only way to get at some of these novel variants, especially the insertion and deletion polymorphisms, is by resequencing. And as a researcher who is interested in identifying the genetic basis — not the SNP basis — of human disease, I am interested in identifying all possible variants, or causes, to the disease, prior to screening my larger population.
Can you give an example of how you plan to integrate high-throughput sequencing in your studies?
For example, in my laboratory, we have a collaborative project with the Colon Cancer Family Registry, where we are screening 52 candidate genes using a tag-SNP approach, which is built on the HapMap project. Ideally, I would like to resequence all those 52 genes in a subset of our population, identify not only the SNP variants, but the insertion and deletion polymorphisms as well, and then design an experimental approach around that data, prior to genotyping the 7,200 samples. It would be more complete than a SNP-based study alone. It would be a combination of sequencing first, and then genotyping second.
Secondly, chips from Illumina today allow us to actually add custom content. So you can envision that prior to doing a genome-wide scan, you resequence on a genome level, not at a candidate gene level, a subset of the samples that are ultimately going to be genotyped on the genome-wide scan technology, in order to identify variants that are not currently captured by the genome-wide products. And if that’s true, then there are custom strips available to you on the Illumina platform that you can then add that custom content to. I don’t know if this is going be implemented right away because the data coming out of this Levy et al. paper, as well as other data, suggests that the tag-SNP approach to genome-wide association studies is quite powerful already today.
The other experimental design we can envision is, today, we do genome-wide studies in stages. In each one of these stages, we reduce the area of the genome we look at. So we start with 4,000 samples on 500,000 SNPs. We then identify, depending on cost, maybe the top 1,000, or in some cases even only the top few hundred SNPs, and then genotype those in a confirmatory population, of maybe another 4,000 samples. Then that leads us to stage three, where we are left with just a few handfuls of significant SNPs, and at that stage, we start resequencing.
With the throughput capabilities of the next-generation sequencing technologies — both base pairs per run and genomic coverage — we are likely to no longer limit ourselves to a handful of candidate resequencing regions. We can actually open the pipeline a little bit more and resequence dozens of candidate gene regions. For example, in the type 2 diabetes studies that were published this past summer, they identified ten causative regions in the human genome that are associated with diabetes. Do you think I want to sequence them one at a time? I want to sequence them all at once. And we can do that; at least these technologies afford us the possibility to do it. It’s both cheaper as far as consumables [than Sanger sequencing], and it’s far more efficient as far as time goes.
Do you see any limitations caused by a lack of methods to select regions of the genome for sequencing?
It’s labor-intensive but not impossible. It just takes a little more labor and a little more effort to get those studies done today than I presume it will tomorrow. We are working on some [methods here at TGen] to make that process more efficient. But long-range PCR for a region or two, or maybe three or four, of the genome, provided those regions aren’t terribly big, is doable today.
Some genome sequencing projects will only sequence exons. From your experience with whole-genome scans, does that make sense?
That has largely been the byproduct of our economic need to take a reductionist approach to experimental designs. The evidence so far to date is somewhat biased. There is a significant number of Mendelian diseases with variations in the coding region of the genes. But that data is somewhat biased because we didn’t really look at non-coding regions in the past for economic purposes. And I think what we are starting to see today with the genome-wide association studies that have been published in the last six months to a year is, roughly half of the association regions published to date have been found in non-genic regions of the genome, while the other half have been found in genic regions. Going forward, we will see some groups prioritize coding sequence over non-coding sequence, but the cost and efficiency of next-generation sequencing is likely to take that consideration off the table.