Sponsor: Rubicon Genomics
Recording Date: 2/19/2014
Recording Time: 1 hour
Next-gen Sequencing for AgBio
Table of Contents
Letter from the Editor
Index of Experts
Q1: What steps do you take to optimize sample quality?
Q2: How do you maximize the amount of input DNA?
Q3: How do you ensure accuracy and reproducibility in your sequencing run?
Q4: How do you best perform alignment and/or assembly? What specific tools do you use?
Q5: How do you annotate your genome? What methods do you use to perform comparative analysis?
Q6: What tools do you use for data visualization and/or computational analysis?
List of Resources
With next-generation sequencing technology becoming more affordable, there's been a near-explosion of sequencing projects for plants and animals. In the agricultural biology space, these projects encompass organisms as diverse as avocado, barley, swine, cassava, horse, maize, sheep, and many other genomes that might be important as crops or livestock.
In our first technical guide addressing this area, we covered sample prep, and our experts tipped readers off on the challenges to working with different types of plants, incomplete genomes, and a lack of protocols. For this guide, we gathered sequencing experts to address the challenges of using next-gen sequencing platforms for these organisms. Here, you'll find answers to questions involving sample extraction, alignment, and assembly problems — especially when it comes to a lack of sequenced genomes to use as reference.
There's also a selection of publications and web sites that will serve as a handy resource for making your sequencing runs and data analysis easier.
— Jeanene Swanson
Many thanks to our experts for taking the time to contribute to this technical guide, which would not be possible without them.
John Innes Centre, Norwich Research Park
Institute for Food and Agricultural Research and Technology (IRTA), Barcelona
UC Davis Genome Center
(with input from Charlie Nicolet, Marta Matvienko, Alex Kozik, and Dawei Lin)
Genoscope, National Sequencing Center, France
Our preferred source of biological material is tissues from actively expanding leaves, from glasshouse-grown plants. These give good yields of either DNA or RNA. Much of our work with next-generation sequencing has involved transcriptome sequencing, so the usual rules for RNA quality apply, i.e. rapid freezing of tissue in liquid nitrogen, ensuring material stays frozen whilst being ground to a fine powder, and use of RNAse-free reagents and equipment throughout the extraction and purification process.
Quality of the sample is essential for the preparation of 454 libraries. For de novo sequencing projects in plants, removing as much chloroplast as possible from the starting material is essential. We etiolated plants for two to three days and then we enriched the sample in nuclei. We extracted DNA from the nuclei enriched fraction using phenol:chloroform extraction. Working with leaf fresh tissue was essential for us; we didn't have good results with frozen tissue. We were also very careful when handling genomic DNA to avoid shearing it. DNA integrity was checked by agarose gel, quality by Nanodrop readings and quantity by PicoGreen. When working with cDNA we used Bioanalyzer to test for possible RNA contamination and size distribution.
We use a combination of agarose gels and Bioanalyzer runs. After fragmentation, each sample is purified on QIAquick or MinElute columns. We strongly recommend all samples that go on the Illumina Genome Analyzer sequencer are first run on the Agilent Bioanalyzer. This gives us quantitative information used to determine the dilution for plating. It also identifies components of the library that might interfere with the sequencing, for example, adapter dimers that yield useless reads. We provide guidelines for library construction (mostly based on Illumina's recommendations) for the number of PCR cycles, adapter amounts, and purification procedures designed to optimize amounts of "good" library DNA and minimize amounts of extraneous products like primer dimers and adapter dimers.
Next-generation sequencing platforms require tiny quantities of DNA for analysis. However, the processing of starting materials to produce the sequencing libraries involves quite a long series of steps, so the availability of adequate starting material is important. This is more challenging for our usual target, mRNA, than it is for DNA. We use the E.Z.N.A. Plant RNA Mini Kit (Omega Bio-tek) for mRNA purification, which we have found to work very reliably. We have recently been sequencing pooled genomic PCR products (using a barcoding procedure to enable attribution to the original sample), which produces plenty of input DNA for next-gen sequencing.
In our particular case, for a de novo plant genome sequence, the amount of input DNA was not a limiting factor to construct all the libraries and test quality. We always used young leaves and fresh tissue, which is what works best for us. Precipitation with ethanol for concentrating the sample worked well; we didn't use any commercial kit. When the input DNA is cDNA, reaching the amounts of cDNA recommended by 454 can be problematic in some particular cases. However, there are amplification kits available when the amount of starting material is very limited.
We usually do not need to. Each species has its own peculiarities and usually the investigator knows of an adequate DNA extraction protocol. As long as we have a few micrograms of DNA, the sample prep succeeds.
For these two questions, we work mostly on de novo genome sequencing, so we request high-quality DNA preps. We try to avoid whole genome amplification for re-sequencing purposes, as this led to biases in representation. We also observed that degraded DNA frequently led to biases in sequencing representation, so we test most samples on gels prior to library construction.
To date we have used exclusively the Illumina (Solexa) platform, for which protocols are very much kit-based. Provided the manufacturer's instructions are followed carefully and precisely, the output data are consistently of a high standard. Optimization of sequence yield per instrument run requires an initial single-lane analysis of each new sequencing library, in order to optimize the quantity of material loaded. The quality of the input mRNA used for library construction is critical to the quality of the output sequences. We verify this by using an Agilent 2100 Bioanalyzer, and proceed only with RNA samples where the RNA Integrity Number (RIN) value is greater than 8.
There are many parameters that are important from the DNA sample preparation until the sequencing reaction. Among them, the quality of the single-stranded library in the 454 system is essential, as well as using the right DNA copy per bead ratio in the emulsion PCR and the following handling of the DNA beads. For the Titanium shotgun runs performed in a de novo plant genome sequencing, we have used the same library for several runs and we had different run outputs. We improved the system in collaboration with the 454 support team until runs with greater than 400 Mb of sequence were obtained regularly. Not overloading the sequencing plate has given us longer reads. Although this represents a small decrease in the total number of reads per run, read length is an essential parameter for a proper genome assembly.
All instrument set up procedures are on a checklist that is followed for each run. Metrics at every step of the procedure are recorded and compared to ensure instrument reproducibility. We still run a control lane on every flow cell and the mutation rate and percent align metrics reported by the pipeline are documented and compared run-to-run. Deviations from expected, average values are investigated and if necessary, reported to Illumina for service or technical support so the next run is back to normal.
We regularly test new reagents by re-sequencing a finished bacterial genome whose sequence is known. All new protocols, and new versions of DNA sequencers, are tested with the same procedure.
For alignment of next-gen sequencing reads to reference sequences, we have so far used an open-source tool, Maq, although code development by the author has now ceased in favor of BWA, which implements the emerging standard SAM alignment format. As genome sequences are not yet available for the species we study, we developed our own approach to SNP discovery. This uses unigenes assembled from public ESTs, which served as our reference sequences, and Perl-scripted analyses of the outputs of the alignments in order to identify genuine SNPs between cultivars, as distinct from inter-homoeolog polymorphisms (we work principally with polyploid crops). De novo assembly has therefore not been a requirement for us, although we have successfully used Velvet to assemble the un-aligned reads in order to generate adjuncts to the unigene-based reference sequences.
We regularly use the 454 genome assembler, as we are dealing with the assembly of a 450-Mbp plant genome de novo. Other tools have been described, and we may try them later in order to compare with the performance of the 454 Newbler Assembler. For a proper genome assembly it has been essential to use paired-end libraries of different fragment sizes (3, 8, and 20 kb). The availability of good physical and genetic maps of the sequenced genome also helps a lot in the assembly process. Availability of BAC end sequences is also interesting.
We do de novo assembly using Velvet and CLC Genomic Workbench. These programs use different algorithms and result in assemblies that complement each other. Velvet is currently the better of the two for analyzing large datasets. We are eagerly waiting for the updated version of the CLC assembler that will process more reads and incorporate the use of paired-end information for de novo assemblies. We use PCAP/CAP3 and Newbler for 454 sequences. Filtering to select for high quality reads is key to successful analysis and assembly. We have custom scripts to process short reads prior analysis (http://code.google.com/p/atgc-illumina/). Alignments to the reference sequences are usually done by CLC and Maq programs and sometimes simply by BLASTN searches of reference sequences versus short reads. We use custom BLAST parsers to study alignments to a reference sequence in great detail; however, they do not scale well for large projects.
When using 454 data, we assemble with Newbler 2 or Celera Assembler. Short reads are assembled with either Velvet or SOAP. We found that for a given genome, the best assembly may sometimes be obtained with one or another assembler, so we routinely use more than one method and then select the best one for the particular project.
For alignment, we are using either gsMapper (for 454 data) or SoapSNP (for Illumina data).
This has not been a primary objective for our use of next-gen sequencing data. However, in a recent collaboration, we have demonstrated that an annotation pipeline that we developed for use with finished BAC sequences (http://brassica.bbsrc.ac.uk/about_jic_annotate.html) scales well to megabase assemblies derived from NGS data.
We do not do this in our lab. Here we will collaborate with a research team that is expert in genome annotation, mainly using the 'ab initio' prediction program GeneID.
We have not been annotating whole genome assemblies for large genomes yet. To validate assemblies and obtain preliminary annotation we use BLAST search against GenBank RefSeq. For gene prediction in microbial genomes, we have been using Glimmer. We use MAKER for whole genome annotation of intermediate sized genomes. To find intron-exon gene structures we align transcriptome to genomic assemblies. Alternative splicing and dissection of gene families is studied by re-alignment of raw Illumina reads to assembled contigs.
We annotate eukaryote genomes using the GAZE software. This creates gene models by reconciliation of data of three different origins: de novo predictions (using SNAP and GeneID); transcriptome data generated by NSTs; and data on public proteomes using GeneWise predictions. For all comparative analyses, we perform bi-directional best match searches, and then use the results to reconstruct synteny blocks, or gene family histories. These data are then functionally interpreted by comparison with Interpro domain content.
There are many visualization tools for short-read alignments currently available. Our feeling is that there is no perfect solution as yet. However, we have found that Maqview generally matches our needs well, offering high speed access to defined regions coupled with a good visual display. A shortcoming is its use of OpenGL for rendering, as not every Windows X-server application fully supports OpenGL and this can cause difficulties for PC access to the 64-bit Linux platforms that we use for heavy computation tasks. Looking forward, the SAMtools suite includes a visualizer for SAM output.
— Ian Bancroft
We do not do this in our lab. In the de novo plant sequencing project, we will create a structured database capable of integrating all the data generated and we will build an easy-to-use web interface capable of doing complex queries and data mining.
We use publicly available tools such as UCSC Genome Browser, GBrowse, EagleView and Consed. Commercial software, such as CLC and Newbler, have their own visualization tools. In addition, we have custom scripts for detailed representation and analysis.
Genome assemblies and annotations are visualized using GBrowse. For synteny comparisons, visualization is performed with Circos.
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60. Epub 2009 May 18.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. Epub 2009 Jun 8.
Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008 Nov;18(11):1851-8. Epub 2008 Aug 19.
Trick M, Cheung F, Drou N, Fraser F, Lobenhofer EK, Hurban P, Magusin A, Town CD, Bancroft I. A newly-developed community microarray resource for transcriptome profiling in Brassica species enables the confirmation of Brassica-specific expressed sequences. BMC Plant Biol. 2009 May 8;9:50.
Trick M, Long Y, Meng J, Bancroft I. Single nucleotide polymorphism (SNP) discovery in the polyploid Brassica napus using Solexa transcriptome sequencing. Plant Biotechnol J. 2009 May;7(4):334-46. Epub 2009 Jan 21.
Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008 May;18(5):821-9. Epub 2008 Mar 18.
UCSC Genome Browser