Next-Gen Sequencing Sample Preparation Technical Guide

Table of Contents

Letter from the Editor
Index of Experts
For Roche/454 Users:
Q1: How do you ensure accuracy and reproducibility when you isolate genomic regions of interest to be sequenced?
Q2: How do you optimize the amount of input DNA?
Q3: What steps do you take to ensure a time-effective sample preparation protocol?
For Illumina/Solexa Users:
Q4: How do you ensure accuracy and reproducibility when you isolate genomic regions of interest to be sequenced?
Q5: How do you optimize the amount of input DNA?
Q6: What steps do you take to ensure a time-effective sample preparation protocol?
List of Resources

Download the PDF version here

Letter from the Editor

For the first installment in what we envision to be a series, GT looks to the future of next-gen sequencing. By the end of last year, three next-gen platforms had made it to market: Roche/454's Genome Sequencer FLX (an upgrade of the Genome Sequencer 20); Illumina's Genome Analyzer; and Applied Biosystems's SOLiD sequencer. For the purposes of this guide, we've focused on Roche and Illumina, the two platforms that have been around for a year or more to ensure that our experts have had enough time to refine their protocols. While there’s been less demand in the research market for CE instruments, next-gen platforms have roared to life for a number of applications, including de novo genome sequencing, gene expression profiling, ChIP sequencing, small RNA analysis, metagenomics, and resequencing. And considering the ever-declining prices, no doubt scientists will continue to use them for efficient, high-throughput sequencing analyses.

One area that seems to present the most difficulty in these early days is sample preparation. To that end, we've gathered experts familiar with both platforms to lend their insight to the challenge of maintaining efficient, standardized procedures. In this guide, users offer advice on isolating genomic regions of interest, maximizing the amount of input DNA, and ensuring timely preparation procedures. As always, don't miss our resources section, which lists additional places to go for advice on how to keep your next-gen runs as accurate and reproducible as possible.

— Jeanene Swanson

Index of Experts

Genome Technology would like to thank the following contributors for taking the time to respond to the questions in this tech guide.

Ghia Euskirchen
(Mike Snyder's lab)
Yale University

Yuan Gao
Virginia Commonwealth University

Neil Hall
University of Liverpool

Stephen Kingsmore
National Center for Genome Resources

Matthias Meyer
Max Planck Institute for Evolutionary Anthropology

Kenneth Nelson
(Mike Snyder's lab)
Yale University

Anoja Perera
Stowers Institute for Medical Research

Richard Reinhardt
Max Planck Institute for Molecular Genetics

Bruce Roe
University of Oklahoma

Agnes Viale
Memorial Sloan-Kettering Cancer Center

For Roche/454 Users:

Q1: How do you ensure accuracy and reproducibility when you isolate genomic regions of interest to be sequenced?

We don't do much of this. So far we have only isolated genomic regions using high fidelity PCR.

— Neil Hall

Apart from whole genome shotgun sequencing we are currently targeting small genomic regions, which can be easily enriched through pre-amplification by PCR or long-range PCR. In our hands, the success of long-range PCR greatly varies not only with DNA quality, but also with the PCR system, and we found it helpful to evaluate the performance of kits from different suppliers.

— Matthias Meyer

We don't use the 454 for re-sequencing but mainly for de novo sequencing (based on pooled BACs, which have been individually measured and adjusted) and cDNA/microRNA [libraries].

— Richard Reinhardt

We actually rarely focus on specific genomic regions, but when we do, we use "Touchdown" PCR coupled with a second round of nested primers and "Touchdown" PCR to amplify genomic DNA regions of interest.

— Bruce Roe

We require core facility users to provide purified genomic DNA. For whole genome sequencing, we have obtained good quality 454 data using DNA extracted by several different methods (e.g., DNeasy and Proteinase K/phenol-chloroform extraction kits from Qiagen).

We do not believe the purification method is a critical parameter provided that the resultant DNA is high molecular weight and very clean (260/230 > 1.7). For amplicon resequencing, a proofreading polymerase should be used during the amplification. We routinely purify the PCR product using the AMPure Agencourt kit. We have not yet optimized protocols for resequencing long-range PCR products.

— Agnes Viale

Q2: How do you optimize the amount of input DNA?

We use a 2:1 template-to-bead ratio. We don't do titrations and have had consistent runs between 100 Mb and 150 Mb.

— Neil Hall

The material requirements for 454 sequencing are very low; 1 nanogram or less starting material will usually produce sufficient library for sequencing, and there is, in principle, no requirement for optimizing the amount of input DNA. However, this is only true if quantitative PCR is used to estimate the copy number in the sequencing library. The quantification methods suggested in Roche’s library preparation protocol are not sufficiently sensitive, and micrograms of input material are required to detect resulting libraries on Agilent chips or in RiboGreen assays. I generally recommend implementing the quantitative PCR when working with the 454 platform. It not only drastically reduces the material requirements to nanograms or picograms, but in our experience also gives more consistent sequence yields. From 100 or so libraries we quantified with this method, most gave optimal sequence numbers without further titration runs. When using the method for the first time, it is advisable to include an existing, well-titrated sequencing library into the measurements for use as an initial reference point.

— Matthias Meyer

The titration step is the most accurate and best method to optimize input DNA. However, to get into the ballpark, we use RiboGreen and PicoGreen (Invitrogen) assays for quantity and Agilent Bioanalyzer for sizing. (These are the standard 454 methods, but we find that they are essential and cannot be skipped.) A German group recently published a method using qPCR, but we have not tried that yet.

— Kenneth Nelson

We generally check the quality using the Agilent system from which we extract empirical factors, and in some cases we use titration runs.

— Richard Reinhardt

We typically begin making our library with 5 to 10 ug input DNA, and at various stages we quantitate the DNA on the Caliper AMS-90. In the emPCR step we use less input DNA (0.8 molecules of DNA/molecule of beads) rather that the 1.0 to 1.2 molecules of DNA recommended by Roche/454.

— Bruce Roe

This step is crucial. An inadequate copy-per-bead ratio can completely spoil a run. If the DNA is a discrete band, we use a PicoGreen-based quantification method to calculate the molarity of the sample. If the starting material is a smear (e.g., cDNA), we use the PicoGreen results but we size-weight the value according to the Agilent Bioanalyzer DNA 1000 Assay results. This approach was developed empirically but it works fairly well.

— Agnes Viale

Q3: What steps do you take to ensure a time-effective sample preparation protocol?

Really we only use the manufacturer’s protocol. Shortcuts such as cetrifuging to break emulsions have not worked for us. At the moment, we find that shortcuts have reduced our throughput.

— Neil Hall

Sample preparation for 454 sequencing in our lab often involves barcoding of multiple samples before the construction of a single sequencing library. This adapts the 454 technology for use with multiple samples and in many cases better exploits the sequencing resources. Since the barcoding reactions add to the time required for sample preparation, we have developed a protocol for multichannel setup in plates, allowing for partial automation on a pipetting robot. Once the samples are barcoded, Roche’s standard protocol for sequencing library preparation only takes some hours. However, we have observed that sequencing libraries degrade very rapidly. Freezing libraries in aliquots immediately after their production is very helpful to decrease the risk of failed or suboptimal sequencing runs, and can therefore save a lot of time and money on this side.

— Matthias Meyer

Since the library prep usually yields enough DNA for multiple sequencing runs, one careful library prep is very time effective. We have not found any real shortcuts to the Roche/454 protocols.

— Kenneth Nelson

We consequently stick to protocols supplied by Roche.

— Richard Reinhardt

We adhere to a strict time schedule for the library and emPCR protocols that has been established over the past two-plus years. My technicians and students doing these protocols also work in teams and that helps keep to the set schedule.

— Bruce Roe

At this point, we are still processing our samples manually. To reduce reagent cost, we first set up two or three emPCR per sample with different copy-per-bead ratio. Then, based on the percentage of bead recovery, we select an optimal ratio and process the remaining samples using this ratio for the emPCR. This process bypasses the titration on PTP, but does not reduce the processing time (in general, we perform sample preparation/processing Monday through Thursday and run the 454 Thursday nights).

— Agnes Viale

For Illumina Users:

Q4: How do you ensure accuracy and reproducibility when you isolate genomic regions of interest to be sequenced?

Most of our Solexa (Illumina) work is ChIP sequencing. Many of the standards that were developed for ChIP-chip also apply to ChIP-seq, with antibody validation being critical to all ChIP experiments. We validate antibodies by IP-western as well as by mass spectrometry. For reproducibility we perform and evaluate three biological replicates, zeroing in on control loci if they are known for a given factor.

— Ghia Euskirchen

We pretty much check the accuracy and reproducibility by:

• mapping the reads to the regions of our interest
• using Sanger sequencing to confirm
• performing technical replicates to see correlation

— Yuan Gao

The National Center for Genome Resources currently has two Solexa-Illumina sequencers in full-time operation and a third on its way. About one half of our throughput represents in-house samples and the other half are provided by academic and industry clients nationwide. To date, we have brought two applications into full production — genomic DNA sequencing and messenger RNA sequencing. The mRNA protocol was developed by Gary Schroth's group at Illumina and has been tweaked by Jim Huntley at NCGR, while our genomic DNA protocol is standard. For these sample types, we have developed standard procedures and a LIMS system to ensure accuracy and reproducibility. It tracks each sample through the Solexa sequencing process and Joann Mudge at NCGR has been working hard to validate quality metrics at various stages of the process. The standard yield that passes quality control from seven channels is ~1 gigabase of singleton reads. Our standard read length is 36 bp, although we've recently been extending this to 46 bp.One neat accuracy check that we've done is to run a set of samples both on the Solexa sequencer and on Infinium HapMap 550K genotyping chips. This has helped us tremendously to validate raw and bioinformatically filtered SNP detection accuracy. For nucleotide variant detection and management of case-control association studies we are using a software system we've developed called Alpheus (http://alpheus.ncgr.org/). For other sample types, such as isolated genomic regions of interest, we ask clients to do the isolation and first steps in the library preparation. They ship us libraries and we generate clusters and sequence them. The yield and quality of these libraries vary.

— Stephen Kingsmore

So far we have not isolated genomic regions. We have only performed whole genome-wide experiments. In the future, if we do isolate regions we will have to perform validation experiments. The type of validation experiment will depend on what regions are isolated and the techniques used to isolate. For instance, if we do long-range PCR to isolate a small region, we could run a gel to ensure we are amplifying the expected size. Also, we can perform Sanger sequencing with the PCR primers to confirm the amplified region.

— Anoja Perera

Any kind of UV- or gel-based measurements are used to determine the amount of PCR-amplified samples, cDNA [libraries] for expression profiling or ChIP-based experiments.

— Richard Reinhardt

Q5: How do you optimize the amount of input DNA?

Library size is an important parameter in obtaining good quality data. We monitor library performance in part by examining sequence data for identical reads which can be generated during the PCR amplification step if insufficient starting material was used. Additionally, if there is an excess of adapters relative to input material, the adapters ligate to each other without an insert and yield a large number of adapter reads.

— Ghia Euskirchen

We have used different amounts of input DNA to make libraries and then determine which concentration yields better results. We found out that the most important optimization is the input library concentration. We usually use 3 pM to 4 pM of library DNA to generate clusters. There are many ways to measure the concentration of the library. We used a combination of measuring the amount of input DNA by Nanodrop and running against a quantitative marker on a gel. We highly recommend doing both, as this may be the most important factor to determine your final sequencing output.

— Yuan Gao

There are two points at which we seek to optimize the amount of input material. The first is at the time of RNA library generation, when many clients want to generate sequence from as little as 1 microgram of total RNA. The second point is at cluster generation. Addition of either too much or too little library results in fewer sequence reads. The optimal number of clusters will generate almost 5 million passing reads per channel. We use an Agilent Bioanalyzer to determine the library concentration and typically load 1 pM to 3.5 pM.

— Stephen Kingsmore

Quantity as well as quality matters when it comes to input DNA. Here, an efficient cleanup technique is a must!

— Anoja Perera

We try to determine the amount of clusters generated by fluorescent measurement, but mainly it is based on empirical feeling and empirical factors.

— Richard Reinhardt

Q6: How do you ensure accuracy and reproducibility when you isolate genomic regions of interest to be sequenced?

We find the genomic and ChIP DNA library preparation to be quite straightforward. Mostly we try to space out our samples during the library preparation to avoid any cross-contamination.

— Ghia Euskirchen

Solexa sample preparation is easy enough. We pretty much follow Illumina's protocol.

— Yuan Gao

The Solexa-Illumina sample preparation protocol is fast (~a day) and several libraries can be generated simultaneously. The bottlenecks in the process are not at sample preparation, but at cluster generation (we have two cluster stations for two sequencers to alleviate this), sequence generation (particularly when we are generating 46-bp reads), basecalling, and genomic alignments.

— Stephen Kingsmore

Plan ahead of time, set up a schedule, and organize yourself. Familiarize yourself with the protocols beforehand. Make sure all reagents and supplies are available to work with. Have a backup plan! For example, have extra supplies in case something goes wrong. We have had two faulty amplification manifolds in the past, and if we didn't have backup ones our experiments would have been delayed. Read your protocols and draw out timelines next to the steps. The gene expression protocols take three full days and without proper preparation you will be putting in more than eight hours. Look to the next steps while you are on a waiting step to see what needs to be thawed to cut out down time. Arrange your work area to maximize workflow.

— Anoja Perera

We use the cluster station from Illumina but try to consequently follow the protocols.

— Richard Reinhardt

List of resources

Our panel of experts referred to a number of publications and online tools that may be able to help you get a handle on sample preparation for next-generation sequencing. Whether you're a novice or pro at this new technology, these resources are sure to come in handy.

Publications

Brockman W, Alvarez P, Young S, Garber M, Giannoukos G, Lee WL, Russ C, Lander ES, Nusbaum C, Jaffe DB. Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Res. Jan 22, 2008 [Epub ahead of print].

Don RH, Cox PT, Wainwright BJ, Baker K, Mattick JS. 'Touchdown' PCR to circumvent spurious priming during gene amplification. NucleicAcids Res. 19(14): 4008 (1991).

Fahlgren N, Howell MD, Kasschau KD, Chapman EJ, Sullivan CM, Cumbie JS, Givan SA, Law TF, Grant SR, Dangl JL, Carrington JC. High-throughput sequencing of Arabidopsis microRNAs: evidence for frequent birth and death of MIRNA genes. PLoS ONE. 2(2):e219 (2007).

Hafner M, Landgraf P, Ludwig J, Rice A, Ojo T, Lin C, Holoch D, Lim C, Tuschl T. Identification of microRNAs and other small regulatory RNAs using cDNA library sequencing. Methods. 44(1):3-12 (2008).

Hillier LW, Marth GT, Quinlan AR, Dooling D, Fewell G, Barnett D, Fox P, Glasscock JI, Hickenbotham M, Huang W, Magrini VJ, Richt RJ,Sander SN, Stewart DA, Stromberg M, Tsung EF, Wylie T, Schedl T, Wilson RK, Mardis ER. Whole-genome sequencing and variant discovery in C. elegans. Nat Methods. 5(2):183-8 (2008).

Meyer M, Briggs AW, Maricic T, Höber B, Höffner B, Krause J, Weihmann, Pääbo S, Hofreiter M. From micrograms to picograms: Quantitative PCR reduces the material demands of high-throughput sequencing. Nucleic Acids Res. 36(1):e5 (2008).

Meyer M, Stenzel U, and Hofreiter M. Parallel tagged sequencing on the 454 platform. Nature Protocols. 3:267-278 (2008).

Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith OL, He A, Marra M, Snyder M, Jones S. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 4(8):651-7 (2007).

Rusk N, Kiermer V. Primer: Sequencing — the next generation. Nat Methods. 5(1):15 (2008).

Schuster SC. Next-generation sequencing transforms today's biology. Nat Methods. 5(1):16-8 (2008).

Tarasov V, Jung P, Verdoodt B, Lodygin D, Epanchintsev A, Menssen A, Meister G, Hermeking H. Differential regulation of microRNAs by p53 revealed by massively parallel sequencing: miR-34a is a p53 target that induces apoptosis and G1-arrest. Cell Cycle. 6(13):1586-93 (2007).

Wold B, Myers RM. Sequence census methods for functional genomics. Nat Methods. 5(1): 19-21 (2008).

Conferences

Next Generation Sequencing: Platforms, Applications, and Case Studies (CHI conference)
http://www.healthtech.com/2008/seq/index.asp

Next Generation Sequencing Symposium
http://www.nminbre.org/pages/events/nmbis/2008/

Next-Generation Sequencing Data Management
http://blog.bioteam.net/2008/01/15/workshopnext-generation-sequencing-data-management/