Genome Capture Technical Guide

Table of Contents

Letter from the Editor
Index of Experts
Q1: How do you fragment your DNA and why do you use that method?
Q2: What capture method do you prefer and why?
Q3: What do you consider an acceptable rate of capture?
Q4: How do you work to increase the specificity of the capture?
Q5: What potential sources of bias do you watch out for?
Q6: What quality control steps do you include?
List of Resources

Letter from the Editor

Sequencing sure has changed since Frederick Sanger developed his chain-termination method in the 1970s. These days, it's all high-throughput and parallelized. That can get pricey — particularly if you're only interested in certain focused parts or certain exons of the genome. Enter "genome partitioning," "exon capture," and "selective amplification."

Those phrases, and myriad combinations and permutations of those phrases, refer to a few types of experiments, though they all have the same goal: to target certain parts of a genome of interest to be sequenced. There's still uncertainty clouding just how these experiments should be run — after all, most of the articles published on these methods have only come out in the last year or two. In this technical guide, our panel of experts addresses some of the main questions that come up when someone embarks on a genome selection or capture experiment — how do you even start? What are the pros and cons of one method versus the others? And just how robust is it, anyway? Here, our experts let you know what's worked for them.

If our experts haven't answered your genome selection questions, be sure to check out the references at the back of the guide. The field is moving rapidly and it's best to get the basics down before sequencing gallops on ahead.

Ciara Curtin

Index of Experts
Many thanks to our tireless experts for taking the time to contribute to this technical guide, which would not be possible without them.

Jon Armstrong
Washington University
Genome Sequencing Center
Technology Development

Matthew Bainbridge
Baylor College of Medicine

Shawn Levy
Director, Vanderbilt Microarray Shared Resource
Assistant Professor
Vanderbilt University Medical Center

Michael Zwick
Assistant Professor
Department of Human Genetics
Emory University School of Medicine

Q1: How do you fragment your DNA and why do you use that method?

We use the Covaris S2 Sonolab instrument for fragmentation. This instrument gives a tighter fragment size distribution than any of the other technologies, which allows us to skip a gel size selection step.

Because the fragmentation process is self-contained, it allows us to use a minimum of input genomic DNA in our WUCap (Washington University Capture) method and a 96-probe version of the instrument makes the approach more adaptable to automation and higher scale.
— Jon Armstrong

We use nebulization because the target size range for Roche/NimbleGen chips is 400 to 900 base pairs and it works fast and is easy to do.
Matthew Bainbridge (with Lynne Nazareth, head of Library Production at the HGSC)

We currently fragment DNA for most protocols using the Diagenode Bioruptor. This instrument allows six to 12 samples to be sonicated in sealed tubes at the same time with no worries of sample loss or cross-contamination. Although the instrument may not have the same options for very specific tuning as some other instruments such as the Covaris device but it is very competitively priced at less than $10,000 and gets the job done well. In rare cases, we will fragment DNA using restriction enzymes for very specific applications.
— Shawn Levy

We currently use sonication although we are planning on switching to the Covaris platform in the near future. Sonication has worked more reliably than nebulization in our hands.
— Michael Zwick

Q2: What capture method do you prefer and why?

We prefer solution-phase capture. DNA:DNA hybridization kinetics in solution are better understood than solid-phase, and solution-phase hybridization is very DNA-efficient, allowing us to use around 1 µg of DNA which is less than any of the available solid-phase technologies. Also, solution-phase is more amenable to multiplexing and to automation than solid-phase.
— Jon Armstrong

We use Roche/NimbleGen solid capture microarrays. The current HD2 arrays allow us to capture the entire CCDS exon set (36 Mbps of target) on a single array. They provide relatively Poisson-distributed coverage from target to target and very uniform coverage across the length of the target.
— Matthew Bainbridge

We view all of the capture technologies as works in progress and look forward to seeing continued development and advancement in the field of targeted DNA enrichment for genomic sequencing. That said, we have been very satisfied with array-based capture for sequencing thus far. The custom designed NimbleGen arrays have been efficient and easy-to-use reagents for the capture of medium (250 kb) to large (6 MB) amounts of genomic DNA in various projects in human and mouse. We were able to use existing microarray hybridization equipment to process the arrays, lowering the capital expense of establishing the protocol. The simplicity of the sample preparation methods has also allowed optimization for each array design and provided flexibility in exact DNA conditions used as input.
— Shawn Levy

We have only used solid state selection to date. We have focused on this approach for three reasons. First, the cost for performing a limited number of experiments is much lower that the solution based methods. Second, we have developed the hardware necessary to carry out the experiment in our lab. Third, we are interested in capturing all classes of genetic variation (SNPs and indels) and it remains unclear if solution based methods will be able to effectively capture indels. However, we are agnostic about the technology, and our ultimate goal is to choose the best technology capable of solving the human genetics questions we are focused on.
— Michael Zwick

Q3: What do you consider an acceptable rate of capture?

This would depend on the number of loci targeted. In our current experiments, we are seeing greater
than 90 percent of the targets captured.
— Jon Armstrong

This is highly dependent on chip design. Chips with smaller total target area, and multiple targets typically get only 50 to 70 percent of reads on target, but this may represent an enrichment of 2x to 400x. Larger targets, or large contiguous regions, may get 80 to 90 percent of reads from the targeted region, but this may only represent 80x enrichment. Generally we've focused more on ensuring uniformity of capture, rather than increasing enrichment rates.
— Matthew Bainbridge

Most of our experience is in microarray-based capture. That said, following data analysis, we expect to see greater than 95 percent coverage of the capture regions represented on the array at reasonably uniform density. Of course there is a multitude of factors that influence the final data quality in terms of coverage as well as multiple ways to look at the data from a quality control perspective.
— Shawn Levy

The rate of capture (or enrichment varies) as a function of the targeted region. The focus of our work is to try to ensure a sufficient level of sequence coverage in order to allow accurate identification of genetic variation. In general, we are aiming for at about 20x coverage.
— Michael Zwick

Q4: How do you work to increase the specificity of the capture?

Decreased specificity in a capture could be attributed to: 1) poorly designed capture probes, 2) less than optimal capture conditions, 3) non-sufficient blocking of repetitive sequences in genomic DNA, and 4) sub-optimal ratio of genomic DNA to capture probes. We are currently refining the parameters for all of these conditions to increase the specificity, or number of captured sequences that align to the target loci in our hybridizations.
— Jon Armstrong

We have worked extensively to optimize capture and elution conditions as well as the amount of blocking DNA used. We are also working on new probe designs that we think will enrich the amount of DNA captured from our target regions.
— Matthew Bainbridge

We have optimized several points in the protocol to improve specificity and efficiency of the overall assay. The final size, size distribution, and amount of DNA used in the assay has been examined. Additionally, the hybridization, washing, and elution methods have also been optimized to improve yield and specificity. Through these efforts, we have been able to increase overall yield from the arrays by about 40 percent. Although we have been able to see an increase in yield, the specificity of the arrays seems to remain about the same with something in the range of 20 to 30 percent of final yield aligning to the genome outside the capture region. The exact percentage varies by array design but is generally very consistent on the same array design.
— Shawn Levy

There are a number of ways one can optimize the hybridization and elution of samples. These include parameters such as temperature, oligo design, and elutate capture off the array.
— Michael Zwick

Q5: What potential sources of bias do you watch out for?

A source of bias could be introduced when very small amounts of genomic DNA are captured. In this scenario, fewer genomic fragments are available for hybridization to capture probes; therefore, fewer fragments are captured and sequenced, which can affect the average depth and breadth of sequence coverage across the targeted loci. We are working to discover the lower limit of input genomic DNA.

Another source of bias could be introduced when captured fragments are PCR amplified in preparation for capture or for sequencing. PCR biasing in library preparation can occur when too many cycles are performed and certain fragments are over-amplified relative to others.

Percent G/C content of the captured fragments can play a role during PCR amplification prior to sequencing, where certain fragments amplify more efficiently, thus certain amplicons are sequenced many more times than others. The effect is an increase in the standard deviation of target coverage. Bias introduced during the above PCR step negatively impacts the number of unique sequence start sites generated.

We also see some bias from using whole genome amplified DNA to make capture libraries; however, we are still forging ahead with this and our experiments look promising.
— Jon Armstrong

Bias can take many forms. You can have bias within a target (end bias) and bias between targets, where some targets are preferentially enriched. You can also have bias against certain targets which can arise from an inability to capture, sequence, or map to that target region. Lastly, because capture probes are designed against the normal reference, you [can] have allelic bias.
— Matthew Bainbridge

It is very important to avoid ¬methods that result in false positive or false negatives in variant detection. Therefore, we are careful to avoid over-amplification or over-representation of particular DNA sequences. Obviously this can only be robustly assessed at the completion of an experiment, but we have examined a variety of protocol conditions and routinely assay for unique fragment start sites to make sure that there is good coverage of the capture region by independent fragments. This provides substantially higher confidence in the data and in a recent experiment we compared independent genotyping data to results from an array capture experiment and were able to show 100 percent concordance in the methods.
— Shawn Levy

We are most concerned with capturing sequences from other locations in the genome that were not targeted in the experiment.
— Michael Zwick

Q6: What quality control steps do you include?

We sequence the capture oligo pool to understand how many of the expected oligos are actually present in the pool. In addition, our capture probes are checked for the correct length and concentration on a Cambrex Flashgel and Agilent Bioanalyzer prior to hybridization.

We have used PCR amplification of genomic DNA amplicons from a number of the targets as a quality control step. If we see the amplicons, we captured those fragments. In the future, we will probably implement qPCR of capture fragments using primers complementary to sequences contained in the targets.
— Jon Armstrong

In addition to standard QCs used for the various sequencing platforms, we also evaluate the quality of the capture with qPCR. We then require a certain rate of mappable reads (platform dependent, typically 50 to 80 percent), and a certain percentage to map to the target regions (target dependent, typically 80x to 400x fold enrichment).

Lastly, we evaluate the quality of SNPs discovered, we typically expect to see 80 to 95 percent of the SNPs in dbSNP, and if other genotyping data is available, 95 percent-plus concordance.
— Matthew Bainbridge

Given the expense of DNA capture and sequencing in addition to the biological value of the samples, we have implemented quality control steps to help ensure the integrity of the assay. We perform quality control at several steps in the sample preparation procedure to ensure appropriate sample conditions (DNA size, yield, etc.). Following hybridization, washing, and elution, we examine yield from the array and fold enrichment based on the comparison of an enriched sample to a non-enriched samples via real-time PCR. There are additional quality control metrics for the data post-capture and post-sequencing. The first assay is the percentage of total sequence reads that align to the capture region. Depending on array design, we expect this to be 70 to 80 percent. We next examine coverage of the capture region and uniformity of coverage. Absolute coverage numbers depend on the specific sequencing technology used as well as the amount of sequence captured, but looking at the distribution of coverage is important to make sure it is reasonably uniform. One thing we have noticed is that the first and last 20 to 25 bases of the target region for each target region are usually not represented in the final data set. We are starting to look at the array designs to see if there is an opportunity to improve coverage at the ends (other than simply extending the target regions by 50 bases per fragment). Final quality control is examining for over-representation of any single fragment and in most cases, duplicate fragments are removed prior to analyzing for sequence variants. We have observed that removal of duplicate reads from the same fragment (not reads from separate fragments covering the same base) drastically improves the final data quality.
— Shawn Levy

We perform qPCR to assess yield at significant steps in the protocol.
— Michael Zwick

List of Resources

Genome capture and partitioning is just getting off the ground. Still, there are many resources to turn to when more information is needed to run a successful and well-designed experiment.

Publications

Albert T, Molla MN, Muzny DM, Nazareth L, David Wheeler D, Song X, Richmond TA, Middle CM, Rodesch MJ, Packard CJ, Weinstock, Gibbs RA. (2007). Direct selection of ¬genomic loci by microarray hybridization. Nature Methods. 4 (11): 903-905.

Bashiardes S, Veile R, Helms C, Mardis ER, Bowcock AM, Lovett. (2005). Direct ¬genomic selection. Nature Methods. 2: 63-69.

Bau S, Schracke N, Kränzle M, Wu H, Stähler PF, Hoheisel JD, Beier M, Summerer D. (2009). Targeted next-generation sequencing by specific capture of multiple genomic loci using low-volume micro¬fluidic DNA arrays. Analytical and Bioanalytical Chemistry. 393: 1, 171-175.

Dahl F, Stenberg J, Fredriksson S, Welch K, Zhang M, Nilsson M, Bicknell D, Bodmer WF, Davis RW, Ji H. (2007). Multigene amplification and massively parallel sequencing for cancer mutation discovery. Proceedings of the National Academy of Sciences of the USA. 104 (22): 9387–9392.

Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, Giannoukos G, Fisher S, Russ C, Gabriel S, Jaffe DB, Lander ES, Nusbaum C. (2009). ¬Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nature Biotechnology. 27 (2): 182-189.

Hardenbol P, Baner J, Jain M, Nilsson M, Namsaraev EA, Karlin-Neumann GA, Fakhrai-Rad H, Ronaghi M, Willis TD, Landegren U, Davis RW. (2003). Multiplexed genotyping with sequence-tagged molecular inversion probes. Nature Biotechnology. 21: 673–678.

Hodges E, Xuan Z, Balija V, Kramer M, Molla MN, Smith SW, Middle CM, Rodesch MJ, Albert TJ, Hannon GJ, McCombie WR. (2007). Genome-wide in situ exon capture for selective resequencing. Nature Genetics. 39 (12): 1522-1527.

Krishnakumar S, Zheng J, Wilhelmy J, Faham M, Mindrinos M, Davis R. (2008). A comprehensive assay for targeted multiplex amplification of human DNA sequences. Proceedings of the National Academy of Sciences of the USA. 105 (27): 9296-9301.

Ni T, Wu H, Song S, Jelley M, Zhu J. (2009). Selective gene amplification for high-throughput sequencing. Recent Patents on DNA & Gene Sequences. 3 (1):29-38.

Okou DT, Steinberg KM, Middle C, Cutler DJ, Albert TJ, Zwick ME. (2007). Micro¬array-based genomic selection for high-throughput resequencing. Nature Methods. 11:907-909.

Porreca GJ, Zhang K, Li JB, Xie B, Austin D, Vassallo SL, LeProust EM, Peck BJ, Emig CJ, Dahl F, Gao Y, Church GM, Shendure J. (2007). Multiplex amplification of large sets of human exons. Nature Methods. 11:931-936.

Shendure J, Ji H. (2008). Next-generation DNA sequencing. Nature Biotechnology. 26 (10): 1135-11445.

Stenberg J, Zhang M, Ji H. (2009). Disperse — a software system for design of selector probes for exon resequencing applications. Bioinformatics. 25(5): 666-667.

Turner EH, Lee C, Ng SB, Nickerson DA, Shendure J. (2009). Massively parallel exon capture and library-free re¬sequencing across 16 genomes. Nature Methods. 6: 315-316.

Zheng J, Moorhead M, Weng L, Siddiqui F, Carlton VEH, Ireland JS, Lee L, Peterson J, Wilkins J, Lin S, Kan Z, Seshagiri S, Davis RW, Fahama M. (2009). High-throughput, high-accuracy array-based resequencing. PNAS. 106 (16): 6712-6717.