Genotyping Technical Guide

Table of Contents

Letter from the Editor
Index of Experts
Q1: What is your genotyping platform of choice and why?
Q2: How do you assess sample quality and quantity?
Q3: What data analysis tools do you use to call a variant?
Q4: What do you consider acceptable completion and concordance rates?
Q5: How do you track samples?
Q6: What internal and external controls do you use and what error rates do you consider acceptable?
List of Resources

Letter from the Editor

With each passing technical guide on the matter, it seems that genotyping is becoming more and more commonplace — you hear about so many genotyping companies such that it seems as if everyone is getting into genotyping. Those of you in the trenches know that it's not as easy as it sounds. In this latest installment of Genome Technology's technical guide series, our stalwart group of experts discusses how they go about tackling their genotyping projects. They discuss what platforms they use and why, as well as how they keep completion, concordance, and error rates to acceptable levels, to make calls they can trust.

Grab your highlighter. This technical guide is chock-full of nuggets that you'll want to remember. As always, don't forget to check out the reference guide in the back to see where our experts turn when they themselves get stuck. Many thanks to our experts for taking the time to contribute to this technical guide.

Ciara Curtin

Index of Experts

Genome Technology would like to thank the following contributors for taking the time to respond to the questions in this tech guide.

Jiang Li
DNA Genotyping Scientist
Hartwell Center for Bioinformatics and Biotechnology
St. Jude's Children's Research Hospital

Lee Murphy
Laboratory Manager
RIE Clinical Research Facility and the Wellcome Trust Clinical Research Facility
University of Edinburgh

Jenny Pansceau
Technician/Research Associate
Ohio State University

Mary Lou Shane
Senior Lab Technician
Vermont Cancer Center DNA Analysis Facility
University of Vermont

Kevin Shianna
Director
IGSP Genotyping Facility
Duke University

Jeremy Taylor
Professor, Animal Genomics
University of Missouri

Q1: What is your genotyping platform of choice and why?

In our core facility, due to the sample volume and efficiency, we choose TaqMan assay and ABI DNA Analyzer. This sequencer actually can do a lot of things, not just sequencing. It can also do SNP genotyping and fragment size analysis and RFLP and short end repeats. For core facility, [I don't] just do a lot of sequencing, but I also do a lot of genotyping. You don't need extra investment of new equipment. You can just go ahead and at the same time do sequencing analysis and do genotyping. [It's] very diverse and also low-cost and also fitting to most of the projects, starting from small projects to the bigger ones. It's very flexible. That's why we choose it.
—Jiang Li

The Wellcome Trust Clinical Research Facility provides genetic support for a wide range of clinical projects from around the UK. In selecting our genotyping platforms, the most important criteria is accuracy and reliability. We also want platforms that are flexible so that we can support projects with a range of SNPs and sample sizes, as well as offer other services such as gene expression. We currently use two different genotyping platforms that work well together, Taqman assays on the Applied Biosystem 7900HT and the Illumina BeadStation. Taqman is particularly good at low SNP numbers and low to high sample numbers. Once a project requires more than 25 to 50 SNPs, it is more economical to use the Illumina platform.
— Lee Murphy

My preferred genotyping platform is fluorescent-based capillary electrophoresis. Our facility is equipped with a 3730 DNA Analyzer (Applied Biosystems) that is used for multiple purposes, such as microsatellite typing, DNA sequencing, AFLP, and single base extension assays, i.e. SNaPshot. Although I am biased because my experience has been limited to electrophoresis systems, I feel that CE is currently the gold standard for covering a multitude of researcher needs and ranges of throughput. Questionable data obtained from these larger-throughput systems are often reviewed and confirmed through CE.
— Jenny Pansceau

We are a core DNA analysis facility and are able to offer genotyping to our investigators on a variety of platforms: genotyping with microsatellite markers, SNPs using the SNaPshot method, and AFLP or T-RFLP samples are run on our ABI Prism 3100 Avant. Genotyping using real-time qPCR, such as higher-throughput SNP detection and transgenic mouse genotyping, is done on our ABI Prism 7900HT Sequence Detection System.
— Mary Lou Shane

For our projects requiring high-throughput genotyping, we use the Illumina genotyping technologies (GoldenGate, Veracode, Infinium HD). The main reason for this choice was the quality and reproducibility of the data.
—Kevin Shianna

We use the Illumina BovineSNP50 BeadChip Infinium assay for all of our high-density SNP genotyping in cattle. The BeadStation scanner is located in the University of Missouri DNA Core where it is broadly used for genotyping and expression analysis. Our lab is a member of the iBMAC (Illumina, USDA ARS Beltsville, University of Missouri, University of Alberta, and USDA ARS Clay Center) consortium that performed the SNP discovery experiment and assay design project that led to the production of the Bovine SNP50 BeadChip. We chose Illumina to manufacture the assay because of the simplicity of their chemistry, robustness of the Infinium assay, and very high quality of produced genotypes both in terms of completeness and low error rate.
— Jeremy Taylor

Q2: How do you assess sample quality and quantity?

Usually, most of the samples we get from here at St. Jude's are clinical samples. A lot of them are diverse in quality, may be partially degraded already, or may have a lot of contaminants. We require at least 60 nanograms to begin with. Some assays may work in the 10 nanogram [range] or even lower. In order to get very good quality of the data, I have it at 50 nanograms. That's not a big issue for most of the projects. Either way, if the customer requires multiple assays, like they want to interrogate 600 or 100 SNPs, we have to do whole-genome amplification. [We use] the Agilent Bioanalyzer to do the quantification, quality and quantification, QC, of the gene target. We're not going to do every sample through the Bioanalyzer because it costs a lot. We randomly choose our samples from each plate.
— Jiang Li

After DNA extraction, all samples are measured for yield using picogreen with a subset of samples run on a gel and the OD ratios measured on a NanoDrop. Accurate quantification and normalization of samples is particularly important for Illumina GoldenGate and TaqMan assays. Fragment size is important for Illumina Infinium chemistry and should be at least 2 kilobases while Illumina GoldenGate and TaqMan can cope with much smaller fragment sizes, 200 to 500 basepairs.
— Lee Murphy

The researchers typically include quality controls in each run and quantify their samples before submitting them. For the few full service projects that are active in our facility, the samples are quantified by fluorometer or by running the samples on a high-resolution agarose gel and observing the bands. For PCR, we include a Centre d'Etude du Polymorphisme Humain sample of a known concentration and a no-template control per batch of samples to assess the quality and consistency of the amplifications. We also include normal DNA control(s) and mutants (if applicable), which we typically obtain from the researchers.
— Jenny Pansceau

Sample quality is generally assessed using the NanoDrop. This gives an accurate quantification of DNA between 2.7 ng/µl and 3.7 µg/µl. We look for an acceptable 260:280 ratio of 1.8 to 2.0, and a 260:230 ratio greater than 1.5. Viewing the absorbance trace can help identify if there are salts or other contaminants present that might interfere with downstream processes.
— Mary Lou Shane

To quantify the level of dsDNA in a sample, we run the picogreen assay from Invitrogen. We use this assay in more of a qualitative fashion and don't establish exact quantitative values for each sample. For example, we have established an acceptable threshold of relative fluorescent units when using the picogreen assay. If a sample is above this threshold then it is ready for genotyping. If a sample is below the threshold we will attempt to concentrate it. If concentrating the sample doesn't work, then the sample will not be processed.

We don't actively assess sample quality. The main reason for this is the time and cost to gain any useful information in deciding the quality of a sample. In a high-throughput lab, it is actually more cost effective to have a slightly elevated failure level than it is to spend the effort to assess quality on every sample.
— Kevin Shianna

We have run over 12,000 samples on this assay and DNA was extracted from semen, blood, nasal swabs, and muscle. All of the extractions were quantitated with either a UV or NanoDrop spectrophotometer and we typically shoot for the 260/280 ratio to be between 1.8 and 2.0. Our experience with the Infinium assay suggests that it is much more sensitive to under-loading than over-loading DNA. Since sample amounts are usually not a problem for us, we
typically use 300 nanograms (200 nanograms recommended) of DNA per sample to allow for the possibility of error in the estimation of DNA concentration because we find that for robustness, it is better to have more than too little DNA. This process works very well for us and we very rarely see failed reactions due to low call rates.
— Jeremy Taylor

Q3: What data analysis tools do you use to call a variant?

I am using the ABI 3730-XL Human GeneMapper 4.0 for fragment size analysis. It generates peak phase, peak area, and location but [you] still need to have some package software to do the secondary analysis. It just gives you the raw data and gives you a call. For SNP genotyping, I want to emphasize that we like SNaPshot better than SNPlex The reason is because for SNaPshot you interrogate each SNP, it requires three oligos and SNPflex, it's two oligos. If you chose SNPlex platform and you think you used all the oligos from the company, it's very expensive, the fluorescently labeled oligos. Most of them have to be HPLC-purified. Also the climate here, we want it to be as cheap as possible.
— Jiang Li

We use the software that comes packaged with the system. For TaqMan this is the Sequence Detection System version 2.3 and for Illumina it is BeadStudio version 3.1.3 with version 3.3.7 of the genotyping module.
— Lee Murphy

Our facility typically processes thousands of DNA sequencing and fragment analysis samples per week. The in-depth genotyping analysis that we provide is limited to fluorescent fragment analysis with the GeneMapper software from Applied Biosystems. The GeneMapper program is capable of detecting variants and providing quality scores for the variant peaks detected by the 3730 DNA Analyzer. In addition, most sample data, with the exception of AFLP/T-RFLP, are individually reviewed by an expert member of our facility. The GeneMapper software alerts the researchers to manual changes to the size standards and/or peak calls made during the data review in our facility. We also perform custom in-depth T-RFLP analysis that we specifically designed for individual researchers.

We apply a general set of measures to assess per-sample genotype quality, but the threshold for calling variants is often tailored to the respective project after we conduct an extensive consultation with the researchers. By default we err on the side of conservancy and prefer to include data that may result in false positives. The emphasis is placed on the researchers to make the final decision about allele and base calls and to track the consistency of known control samples from run to run where applicable.
—Jenny Pansceau

We use GeneMapper for analyzing microsatellite markers, SNPs using the AB SNaPshot system and AFLP or T-RFLP samples. Real-time analysis is done with SDS 2.2 software.
— Mary Lou Shane

We use the Illumina supplied BeadStudio software for SNP calling. For CNV analyses, we use the PennCNV software followed by visual inspection using BeadStudio's Genome Viewer.
— Kevin Shianna

BeadStudio exclusively. Because we contributed to the development of the BovineSNP50 assay we developed our own in-house custom cluster file for calling genotypes based on over 8,000 samples genotyped from various breeds of Bos taurus cattle. This has allowed us to have a "base" cluster file that performs very well across most of the cattle breeds that we genotype. Our cluster file was developed for breeds that originated from the Fertile Crescent and we have found that we can significantly improve our genotype call rates and quality for breeds originating on the Indian subcontinent by developing a cluster file specifically for these breeds. This is quite a bit of work, but is necessary if you want to ensure the highest-quality genotypes.
— Jeremy Taylor

Q4: What do you consider acceptable completion and concordance rates?

Since I use SNaPshot, I never failed any assay, even in some difficult region, GC-rich region, or maybe some region has multiple SNPs connected together. Some extension assay, that's different, you need oligos to bind to the template. If that area has those SNPs, then our independent area, sometimes it's difficult doing extension. There's no problem for me. Extension assay, SNaPshot, and you do the PCR, you used oligos to do the extension. This technology is very sensitive.

In early papers, most people like PCR-SLP. A lot of clinicians, they bring us the paper; they want to do that. They want us to just reproduce the same assay here by interrogating some SNPs. It takes a longer time and it gives you a lot of variability. That's why we find some of the RFLP assays [are] not consistent with our SNaPshot genotyping assays. We do verify the microarrays. It makes the verification of some SNP identified by the microarray. We do find some regions which cannot be verified or make a different call. We believe SNaPshot. If we cannot confirm it, I will direct my customers to believe in SNaPshot. This is standard because Affymetrix platform is hybridization-based, it's based upon the core hybridization signal which is variable. That's not accurate enough for some particular SNPs.
— Jiang Li

The acceptable completion and concordance rates will depend on the study design and also on the nature of the SNPs chosen. Increased accuracy can be achieved by setting the threshold for calling genotypes high, but this can lead to a low call rate for some SNPs so that a number of hits are discarded and can also lead to informative missingness — here a spurious association can occur due to non-random differences in the pattern of missing data. If a lower stringency is used, then call rates will be preserved at the expense of accuracy, leading to some poor performing SNPs showing association.

We would typically expect pass rates of between 90 to 95 percent for Illumina GoldenGate, greater than 95 percent for TaqMan, and higher than 99 percent for Illumina Infinium. We would expect greater than 99 percent concordance for both platforms.
— Lee Murphy

An investigator using our facility says that completion rates should be high, around 98 percent. You also want high concordance rates, from 97 to 99 percent, but they will never be perfect.
— Mary Lou Shane

A sample failure rate of between one to three percent across a large project is acceptable. We don't determine concordance rates for duplicates until after strict curation of sample quality and the SNP data. Therefore, the duplicate concordance rate is always around 99.9999 percent. If you identify the SNPs that are discordant, it is almost always because a SNP for one of the samples in the duplicate pair is not being called rather than being miscalled.
— Kevin Shianna

In general, we don't have a completion spec that we shoot for. In the about 12,000 cattle samples that we have run the average call rate has been 98.86 percent and only a handful have low (less than 90 percent) call rates. These samples simply get filtered out from the downstream
analyses rather than being put back in the queue to be rerun.
— Jeremy Taylor

Q5: How do you track samples?

We have [a] shared resources management system, which is designed by our Hartwell Center informatics team. It's very user-friendly. We can track sample and also the clients, our PIs, they can send me their samples with this system.
— Jiang Li

Samples are tracked by a laboratory information management system, which tracks sample locations on plates and also produces barcode labels. We use liquid handling robots where possible to limit pipette error and this is aided by using robust standard operating procedures and, importantly, having well-trained and motivated staff.
— Lee Murphy

The samples are tracked electronically through the dnaLIMS database. The database also serves as a data storage and access hub for our clients. The typical sample flow begins when the researcher initiates an online order. The database generates an order number for each batch of samples (up to 96 per order) that the researcher enters or uploads into dnaLIMS. The database also generates separate request numbers for each sample that is part of the respective order. Each order arrives with a hard copy of the order form and the samples are labeled with at least two identifiers. The identifiers are typically the order number, researcher's last name, and the date. The facility processes the order accordingly, analyzes and reviews the data, and uploads them into dnaLIMS.
— Jenny Pansceau

Our facility uses a Web-based platform known as the UVM BioDesktop for placing orders and data distribution. Investigators place their orders online and are notified by e-mail when their data is completed. Investigators can then login to their BioDesktop account and download the data files to their own computer. The BioDesktop maintains a permanent backup of their data.
— Mary Lou Shane

We have set up an internal tracking system which we call a "manual LIMS" system where samples are tracked through every step of the process. This is accomplished by using step-by-step tracking sheets that we developed in Excel. This manual system allows us to capture all necessary information on the tracking sheet simply by using a barcode reader.
— Kevin Shianna

The sender's identification and concomitant information provided with the sample (breed, gender, etc) is recorded in a database and each sample is assigned our own unique six-digit internal ID which becomes the primary sample identifier. When we generate the genotyping sample sheet we append four additional digits to generate a 10-digit ID that also contains breed and replicate information. Thus we are able to store sample ID, breed, and replicate within a single integer that only requires 4 bytes of storage in the database.
— Jeremy Taylor

Q6: What internal and external controls do you use and what error rates do you consider acceptable?

Internal and external controls are very important. Even though we are not CLIA-certified, we do follow very stringent rules, at least I do. Especially because there are a lot of multiplexing assays and those assays require higher accuracy. If you don't verify them, you probably just pick the wrong thing. For each individual 96-well plate, we tell our customers, we use two wells as our internal controls and we also ask the customer to provide their own control. It can be blinded so they know the genotypes and they don't tell us. We give them the results and let them decide whether it's accurate or not. Our internal control is also a genetic control, because most of the new SNP ... even the clinicians, they don't know the genotypes. We cannot provide any positive controls, so what we do is, we do Hardy-Weinberg. Sometimes we can do sequencing, if we're really worried about it. Error rate actually is not a big issue but fail rate is the most concerning for genotyping because a lot of sample may be of bad quality or maybe the assay does not work for some samples. Error rates, I would rather consider zero error rate because you need to tell them I provide 99 percent accuracy and the clinician doesn't like it. They want 100 percent accuracy.
— Jiang Li

The controls we run are dependent upon the platform and are used to confirm the protocol is working correctly. Researchers are able to select their own controls depending on the study design, such as duplicated samples to assess reproducibility and assay error rates. Family members can also be used to check for Mendelian consistency. In addition, researchers can select SNPs that have previously been genotyped before on the same samples, but using a different platform.

For TaqMan assays, we run no-template controls at the start of the plate to help determine the clusters and check for contamination. This also confirms the plate was run in the correct orientation. For TaqMan custom assays, we first test on a plate of European Collection of Cell Cultures DNA and researcher's DNA to check the clustering.

Illumina assays have internal quality control standards built in to every array to assess how well each stage of the genotyping workflow has performed. Gender estimates can also be used to check that plates have been run in the correct order and orientation.
— Lee Murphy

The researchers include quality controls with each batch of ready-to-run samples submitted for fragment analysis. Our facility runs in-house quality control samples for DNA sequencing, but this is seldom necessary for fragment analysis due to the limited scope of our facility's involvement in the majority of fragment analysis projects. Because we have processed so many samples requiring different types of analyses (microsatellites, SNaPshot, etc.), we use this data representatively to assess the relative quality and reproducibility of each run or batch of samples. We have representative data from different species, tissues, and nucleic isolation techniques, which may influence the expectancy of genotype quality and success rates on a subjective basis. In addition, the statistics and appearance of the size standards within the electropherograms are often sufficient internal quality control measures.
— Jenny Pansceau

The types of controls used will depend on the type of genotyping being performed, but would generally include a sample from each genotype, a no-template control, and a no-amplification control. Samples that fall outside of the clusters can be sequenced to try and determine the genotype. All samples run on the 3100 Avant have an internal size standard added. Samples are automatically repeated if there is any problem with the size standard. Microsatellites are generally run with a control DNA that has known fragment sizes. An investigator using the facility looks for error rates around one to two percent.
— Mary Lou Shane

When we first started high-throughput processing, we added one Centre D'Etude du Polymorphisme Humain trio control along with five percent duplicates per 96 well plate. However, we quickly learned that the Illumina genotyping assays were very robust and reproducible so controls weren't needed to validate the genotyping assay.

To confirm that no samples have been mixed up during processing, we run at least three Taqman assays using the identical sample plate from the original experiment. These Taqman assays match SNPs that exist on the genotyping BeadChip. If the concordance isn't 100 percent then we add more assays (up to 10) to track any issues.
— Kevin Shianna

We do not have any internal controls as far as the genotyping workflow is concerned. The primary issue that we watch for is when more than one sample per chip or an entire chip fails during genotyping in BeadStudio. This is usually indicative of sample handling issues or faulty chips (which are very rare) in the case of an entire chip failure. Missing data rates are quite small. For example in a genome-wide association analysis performed on 1,720 individuals from the Angus breed, we filtered SNPs with minor allele frequencies less than 5 percent leaving 41,028 loci; the resulting data set had less than 1.3 percent missing data. On the other hand, genotyping error rates are difficult to estimate because they require that you know the true genotypes. However, we have several layers of data filtering that we apply on data extraction with the first usually being a sample call rate of at least 90 percent. After that we generally check for Mendelian inheritance when we have genotyped parents and/or offspring of the sampled individual. With the BovineSNP50 BeadChip we find that a correct parent-child relationship will have less than 0.5 percent discordant genotypes while an incorrect relationship will typically have a 2 to 7 percent discordance rate. When you have an incorrect pedigree relationship it sticks out like a sore thumb.
— Jeremy Taylor

List of Resources

Publications

de Bakker PIW, Yelensky R, Pe'er I, Gabriel SB, Daly MJ, Altshuler D. (2005). Efficiency and power in genetic association studies. Nature Genetics. 37: 1217-1223.

Chanock SJ, Manolio T, Boehnke M, Boerwinkle E, Hunter DJ, Thomas G, Hirschhorn JM, Abecasis G, Altshuler D, Bailey-Wilson JE, Brooks LD, Cardon LR, Daly M, Donnelly P, Fraumeni,Jr JF, Freimer NB, Gerhard DS, Gunter C, Guttmacher AE, Guyer MS, Harris EL, Hoh J, Hoover R, Kong CA, Merikangas KR, Morton CC, Palmer LJ, Phimister EG, Rice JP, Roberts J, Rotimi C, Tucker MA, Vogan KJ, Wacholder S, Wijsman EM, Winn DM, Collins FS for NCI-NHGRI Working Group on Replication in Association Studies. (2007). Replicating genotype–phenotype associations. Nature. 447: 655-660.

McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN. (2008). Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics. 9: 356-369.

Packer BR, Yeager M, Burdett L, Welch R, Beerman M, Qi L, Sicotte H, Staats B, Acharya M, Crenshaw A, Eckert A, Puri V, Gerhard DS, Chanock SJ. (2006). SNP500Cancer: a public resource for sequence validation, assay development, and frequency analysis for genetic variation in candidate genes. Nucleic Acids Research. 34: D617-D621.

Walley DC, Tripp BE, Song YC, Walley KR, Tebbutt SJ. (2006). MACGT: multi-dimensional automated clustering genotyping tool for analysis of microarray-based mini-sequencing data. Bioinformatics. 22(9):1147-1149.

Yang Y, Li SS, Chien J, Andriesen J, Zhao LP. (2008). A systematic search for SNPs/haplotypes associated with disease phenotypes using a haplotype-based stepwise procedure. BMC Genetics. 9:90.

Websites

GeneWindow:
http://genewindow.nci.nih.gov/Welcome

SNP500Cancer:
http://snp500cancer.nci.nih.gov/

TagZilla:
http://tagzilla.nci.nih.gov/