Genetic Variation Technical Guide

Table of Contents

Letter from the Editor
Index of Experts
Genotyping and Copy Number Variation: Greg Khitrov and Daniel Mirel
Array CGH: Wei Wang
Biomarkers: Wolfgang Sadee
List of Resources

Download the PDF version here

Letter from the Editor

Variation is what makes us all different. No matter how much you are told you are like your parents or siblings, there is always something that makes you different, even down at the genetic level. Those changes may just mean that you are a sliver taller or fatter or have darker eyes, or increase your odds of cancer, or they may mean nothing at all.

Biologists have been studying variation for centuries. Gregor Mendel puttered around in the garden of his monastery, and wondered why some pea plants had wrinkled pods and others smooth, and Charles Darwin marveled at the species he saw while traveling aboard the HMS Beagle.

As you search for genetic differences in peas, tortoises, humans, Arabidopsis, or whatever your organism of choice is with more modern technologies, questions revolving around copy number variation, array comparative genomic hybridization, and biomarkers are bound to come up. Our experts in this technical guide tackle questions regarding calling CNVs, obtaining high-resolution data from array CGH, and just how to go about uncovering rare genetic biomarkers.

— Ciara Curtin

Index of Experts

Many thanks to our experts for taking the time to contribute to this technical guide, which would not be possible without them.

Greg Khitrov
Genomics Core Lab
Mount Sinai School of Medicine

Daniel Mirel
The Broad Institute Center for Genotyping And Analysis
Broad Institute of MIT & Harvard

Wolfgang Sadee
Program in Pharmacogenomics
The Ohio State University Medical Center

Wei Wang
Microarray Core Lab
Cornell University

Genotyping & Copy Number Variation: Greg Khitrov and Daniel Mirel

Greg Khitrov

Genome Technology: What is your genotyping platform of choice, and why?

Greg Khitrov: Affymetrix, since that's the technology that's available to us.

GT: How do you maximize the resolution of your CNVs?

GK: for sensitivity, we use selection of minimal regions of five consecutive probes.

GT: What criteria do you use to call your CNVs? How do you minimize your false positive and negative calls?

GK: Different algorithms use different criteria. For instance, a hidden Markov model uses probability of at least 80 percent. We use a segmentation approach and the criteria we use are a p value of 0.01 and that the segment at least contain five consecutive probes. to minimize the false positive calls, we apply multiple criteria to determine the CNV, such as p value, segment length, and magnitude of changes. Since the small regions tend to be easily identified compared to large regions, we weight the changes by the segment length in order to identify regions of various length, which minimizes false negative calls.

Daniel Mirel

Genome Technology: What is your genotyping platform of choice, and why?

Daniel Mirel: Here at the Broad Institute we do not have any one platform of choice. In fact, it is quite the opposite. We manage and run genotyping projects of a wide range of SNP number and a wide range of sample number. Perhaps this could be thought of as two dimensions of genotyping project space. Moreover, the SNP content can be a whole-genome (fixed) panel, or a custom SNP set, and the selection of platform depends on all these factors, as well as others. In general, we offer and stand behind one or two platform options at any particular point in genotyping project space. Indeed, the primary platform vendors we work with (Affymetrix, Illumina, Sequenom) each span a wide range of this space with their products. (See also:

GT: How do you maximize the resolution of your CNVs?

DM: It is important when talking about CNVs to make a distinction between the common, high-frequency polymorphisms that are well-tagged by SNPs (e.g., "CNPs") and rare or de novo insertions and deletions. For the former, where the locations on the genome are well defined (see for example Conrad et al., Nature, 2009), a small number of probes specific to each region are all that is needed. To maximize the resolution of the rare or de novo copy number loci, where the break points are not known a priori, it is necessary to have a high density of probes, as are present on the Affymetrix 6.0 or Infinium 660W or omni1 arrays. I do not believe that one can state with great accuracy that a copy number event is of a particular size and location based on array results alone, but that one must further validate these findings with other technology, such as long-range PCR or re-sequencing.

GT: What criteria do you use to call your CNVs? How do you minimize your false positive and negative calls?

DM: In my opinion there is not yet enough "ground truth" upon which the various extant CNV calling algorithms can be validated and compared. I believe this type of data is now becoming available as a result of the 1000 Genomes Project, but even there there is a circularity, in that the 1KG calling uses SNP and CNP data for calibration. A head-to-head comparison of CNV calling algorithms needs to be done, and I know that people in some of the GWAS consortia are organizing and planning for this evaluation. In the recently completed "bake off" for cross-comparing SNP imputation algorithms, the "test material" was analytical/categorical, e.g. called SNPs perhaps with call confidences associated with them. Here, any evaluation of CNV-calling algorithms that use whole-genome array data as input must acknowledge that the "test material" is experimental, e.g. actual scans that derive from DNA samples of varying quality and provenance. In that regard, measures of CNV calling replicability are required, as well as comparisons to 'gold standard' results from which false-positive and -negative performance metrics can be derived. To my knowledge, these types of metrics about test performance are not yet generally available.

Array CGH: Wei Wang

Genome Technology: How do you account for low-copy shared sequences?

Wei Wang: In array probe design, an effort can be made to search for probes of unique sequence in the vicinity of low-copy shared sequences. Sufficient mismatch is needed between the probe sequence and its match in the whole genome to specifically measure the copy number of this locus. In case this is not feasible, it is still possible to use computational methods to estimate the copy number change in low-copy shared sequences. Higher probe density in such regions can increase the sensitivity in detecting subtle changes in copy number, as more probes provide more accurate measurement of fluorescent signal in the regions and higher statistical power. Calibration of copy number change versus log ratio by titration (either internal chromosome copy number or external spike-in control) will be helpful in determining the magnitude of the copy number change.

GT: How do you obtain high resolution?

WW: Resolution in defining the boundary of CNVs obviously depends on the probe density of microarray, which is the characteristic of the platform. On the other hand, resolution in quantifying the magnitude of change of CNV regions can be improved at each step of the CGH study. In DNA preparation, reducing the amount of unintended contaminating cells can better maintain the CNV fold change in the intended cell population, such as tumor samples with varying extent of normal cells. Depending on the experimental design, choosing a suitable control DNA can also improve CNV resolution: using common control DNA from one individual versus many individuals, or matched control DNA. Microarray scanners with high resolution, low noise, and auto-focusing can obtain better fluorescent signal from [an] array. Better balancing of the two fluorescent channels, achievable from DNA labeling as well as array scanning, can reduce the distortion of CNV fold change due to dye bias. The final array image analysis and CNV calling algorithm may be the most important in getting high CNV resolution. There are many options; the criteria to evaluate them should include reproducibility, magnitude of CNV fold change, and robustness to microarray technical variation.

Biomarkers: Wolfgang Sadee

Genome Technology: What is the best way to identify a common genetic biomarker, and why?

Wolfgang Sadee: For a genetic biomarker to be useful, one would typically like to see a large effect size (odds ratios at least 2 and preferably greater than 3). Any frequent genetic variant that changes the amino acid sequence in an obvious way (e.g., nonsynonymous SNPs) and has strong penetrance is very likely already discovered, at least in the most obvious candidate genes. On the other hand, regulatory polymorphisms represent a vast reservoir of functional genetic variation that has yet to be tapped systematically. Our results have revealed frequent regulatory variants in obvious candidate genes, even though these had already been under intense investigation for many years (e.g., DRD2, tPH2, DAt, CYP3A4, NAt1, ACE, CEtP). It is important to understand that there is a wide spectrum of regulatory mechanisms that impinge upon functional gene expression (protein coding and non-coding), all of which are tissue selective: transcription, hnRNA/mRNA processing, splicing, turnover, cellular trafficking, effect of non-coding RNAs, and translation. A near universal approach to discovery of regulatory variants is the use of allelic expression imbalance (AEI) in target tissues of heterozygous carriers (using marker SNPs in the transcribed region). AEI analysis can be applied to mature mRNA, hnRNA, splice variants, and RNAs sequestered in various cellular compartments. For AEI analysis of proteins — less frequently applied as yet — one needs non-synonymous marker SNPs, gaining access to the analysis of multiple steps in translation and protein processing. Coupled with SNP-scanning of the gene locus to find the responsible regulatory polymorphisms, this comprehensive AEI analysis enables one to address the question [of] how the variant affects gene expression in target tissues, and to what extent the newly discovered polymorphism accounts for genetic variation in a given population. Further study of the underlying molecular genetic mechanism yields a solid ground for testing clinical applications, with high success potential.

GT: How does, or would, your approach differ for a rare biomarker?

WS: Rare genetic biomarkers must have strong penetrance to be of clinical value. Before deciding on a search strategy, one would first ask whether the variant is likely to be recessive or dominant, and what frequency would be clinically relevant. Given a recessive mutation with 1 percent allele frequency, the occurrence of the trait would only be apparent in 1/10,000 subjects. However, a candidate gene may harbor frequent functional polymorphisms that modulate the level of functional expression. one is then confronted with the question [of] what effect one would expect in the much more frequent compound heterozygotes. These considerations direct the strategy. Linkage analysis has been extensively used to detect the genes underlying Mendelian disorders. More recently, very large-scale genome-wide association studies have subject cohorts of sufficient numbers to detect rare variants, even those with moderate penetrance. Here, the temptation is to add up all the significant markers (representing functional variants), rare and frequent, intermediate and high penetrance, to arrive at an individual risk assessment for complex disorders. In my view this is an approach fraught with inbuilt errors, and potentially deceptive. Without even knowing the underlying mechanisms, one compounds uncertainties introduced with each additional biomarker — the predictive power could deteriorate with inclusion of more and more markers. Many newly discovered variants are not located in protein coding regions, again highlighting the importance of regulatory mechanisms. Hence, AEI analysis can be applied to human autopsy tissue collections large enough to detect regulatory variants with allele frequencies at or below 1 percent, as one analyzes the heterozygous carriers. Knowing the frequency and effect size of such relatively rare regulatory variants is essential to select suitable targets for biomarker development. But then, of course, countless other strategies can be deployed, each with distinct advantages and drawbacks.

List of Resources

Sometimes you need more information. Here are more sources that may help you answer our genetic variation questions.


Almagro-Garcia J, Manske M, Carret C, Campino S, Auburn S, Macinnis BL, Maslen
G, Pain A, Newbold CI, Kwiatkowski DP, Clark TG.(2009). SnoopCGH: Software for visualizing comparative genomic hybridization data. Bioinformatics. 25 (20): 2732-3.

Carpaij N, Fluit AC, Lindsay JA, Bonten MJ, Willems RJ. (2009). New methods to analyse microarray data that partially lack a reference signal. BMC Genomics. 10: 522.

Chen J, Wang YP. (2009). A statistical change point model approach for the detection of DNA copy number variations in array CGH data. IEEE/ACM transactions on Computational Biology and Bioinformatics. 6 (4): 529-41.

Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P, Fitzgerald T, Hu M, Ihm CH, Kristiansson K, MacArthur DG, Macdonald JR, Onyiah I, Pang AW, Robson S, Stirrups K, Valsesia A, Walter K, Wei J; the Wellcome Trust Case Control Consortium, Tyler-Smith C, Carter NP, Lee C, Scherer SW, Hurles ME. (2009). Origins and functional impact of copy number variation in the human genome. Nature. E-pub, Oct 7.

Curtis C, Lynch AG, Dunning MJ, Spiteri I, Marioni JC, Hadfield J, Chin SF, Brenton JD, Tavaré S, Caldas C. (2009). The pitfalls of platform comparison: DNA copy number array technologies assessed. BMC Genomics. 10: 588.

Dai Z, Papp AC, Wang D, Hampel H, Sadee W. (2008). Genotyping panel for assessing response to cancer chemotherapy. BMC Medical Genomics. 1: 24.

Fraser HB, Schadt EE. (2010). The quantitative genetics of phenotypic robustness. PLoS One. 5(1):e8635.

Hester SD, Reid L, Nowak N, Jones WD, Parker JS, Knudtson K, Ward W, Tiesman J, Denslow ND. (2009). Comparison of comparative genomic hybridization technologies across microarray platforms. The Journal of Biomolecular Techniques. 20 (2): 135-51.

Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C. (2004). Detection of large-scale variation in the human genome. Nature Genetics. 36 (9): 949-51.

Jaillard S, Drunat S, Bendavid C, Aboura A, Etcheverry A, Journel H, Delahaye A, Pasquier L, Bonneau D, Toutain A, Burglen L, Guichet A, Pipiras E, Gilbert-Dussardier B, Benzacken B, Martin-Coignard D, Henry C, David A, Lucas J, Mosser J, David V, Odent S, Verloes A, Dubourg C. (2009). Identification of gene copy number variations in patients with mental retardation using array-CGH: novel syndromes in a large French series. European Journal of Medical Genetics. E-pub Oct 28.

Kathiresan S, Voight Bf, Purcell S, Musunuru K, Ardissino D, Mannucci PM, Anand S, et al. (2009). Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants. Nature Genetics. 41 (3): 234-41.

McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A, Shapero MH, de Bakker PI, Maller JB, Kirby A, Elliott AL, Parkin M, Hubbell E, Webster T, Mei R, Veitch J, Collins PJ, Handsaker R, Lincoln S, Nizzari M, Blume J, Jones KW, Rava R, Daly MJ, Gabriel SB, Altshuler D. (2008). Integrated detection and population-genetic analysis of SNPs and copy number variation. Nature Genetics. 40 (10): 1166-74.

Morris AP, Zeggini E, Lindgren CM. (2009). Identification of novel putative rheumatoid arthritis susceptibility genes via analysis of rare variants. BMC Proceedings. Suppl 7: S131.

Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, freeman JL, González JR, Gratacòs M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, Tyler-Smith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME. (2006). Global variation in copy number in the human genome. Nature. 444 (7118): 444-54.

Shen F, Huang J, Fitch KR, Truong VB, Kirby A, Chen W, Zhang J, Liu G, McCarroll SA, Jones KW, Shapero MH. (2008). Improved detection of global copy number variation using high density, non-polymorphic oligonucleotide probes. BMC Genetics. 9: 27.

Simpson JT, McIntyre RE, Adams DJ, Durbin R. (2009). Copy number variant detection in inbred strains from short read sequence data. Bioinformatics. E-pub Dec 18.

Steenweg ME, Jakobs C, Errami A, van Dooren SJ, Adeva Bartolomé MT, Aerssens P, Augoustides-Savvapoulou P, et al. (2010). An overview of l-2-hydroxyglutarate dehydrogenase gene (l2hgdh) variants: a genotype-phenotype study. Human Mutation. E-pub Jan 5.

Talseth-Palmer BA, Bowden NA, Hill A, Meldrum C, Scott RJ. (2008). Whole genome amplification and its impact on CGH array profiles. BMC Research Notes. 1:56.

Welch RA, Lazaruk K, Haque KA, Hyland F, Xiao N, Wronka L, Burdett L, Chanock SJ, Ingber D, De La Vega FM, Yeager M. (2008). Validation of the performance of a comprehensive genotyping assay panel of single nucleotide polymorphisms in drug metabolism enzyme genes. Human Mutation. 29 (5): 750-6.

Xing C, Xing G. (2009). Power of selective genotyping in genome-wide association studies of quantitative traits. BMC Proceedings. Suppl 7: S23.



Entrez SNP



Human Variation: Cause and Consequence
June 20-23 / Heidelberg, Germany

Microarray World Congress
Oct. 28-29 / San Diego, CA
Select Biosciences

European Biomarkers Summit
Nov. 9-10 / Florence, Italy
Select Biosciences