Structural Variation Technical Guide

Table of Contents

Letter from the Editor
Index of Experts
Q1: What are the considerations that determine whether you use SNP arrays, array CGH, sequencing, or other technologies when looking for structural variants?
Q2: How do you account for biases in your approach of choice?
Q3: What are the best algorithms for calling structural variants in sequencing data?
Q4: How do you characterize breakpoints for structural variants?
Q5: How do you perform clinical or functional interpretation of structural variants?
Q6: What methods do you use to validate structural variants identified in array or sequencing studies?
List of Resources

Download the PDF version here

Letter from the Editor

It’s estimated that structural variants — comprising copy number variants, deletions, duplications, insertions, inversions, and translocations — account for more differences between human genomes in terms of nucleotides than their more extensively studied cousins, SNPs. And in recent years, genomic tools like array comparative genomic hybridization, SNP genotyping arrays, and next-generation sequencing have uncovered numerous links between structural variants and disorders such as autism, schizophrenia, and Crohn’s disease, and have also highlighted the role such variants play in cancer.

But the technology for mapping these variants still poses challenges for researchers. Arrays have long been the go-to method for researchers in the field, but the information they can provide is limited. Next-gen sequencing, meantime, provides a great deal more information, but short-read technologies require complex bioinformatics methods to accurately stitch together large variants. And approaches for interpreting these findings can be as varied as those for generating the data in the first place.

To address these questions, Genome Technology assembled a panel of researchers who are leading the way in structural variant analysis. We asked these experts a number of questions about the considerations they take into account when planning a structural variation experiment. We hope you will find their insight helpful for your future work in this area.

— Bernadette Toner

Index of Experts

Many thanks to our experts for taking the time to contribute to this technical guide, which would not be possible without them.

Can Alkan
Bilkent University
Department of Computer engineering

Charles Lee
Harvard Medical School
Molecular Genetic Research Unit

Tomas Marques-Bonet
Universitat Pompeu Fabra
Primate Genomics Lab

Tobias Rausch
European Molecular Biology Laboratory
Genomics Core Facility

Andrew Sharp
Mount Sinai School of Medicine
Department of Genetics and Genomic Sciences

Q1: What considerations determine the technology or technologies you use to look for structural variants?

The major consideration used in determining which method to use for structural genomic detection is cost. Array methods provide the most information for the least amount of cost. Sequencing requires significant bioinformatic analyses, but when analyzed properly can provide breakpoint sequence information in as much as half of the structural variants identified — which is important is one wants to understand the etiology of a structural variant formation (i.e., non-allelic homologous recombination based, non-homologous end joining-based, DNA replication error based, et cetera.)

The more complementary technologies that are applied to a given genome, the more accurate of a structural genomic variation profile is obtained. At a minimum, I believe that an array should be run for each human genome being sequenced and the information from both technologies combined in a meaningful way. That is what is currently being done for the 1000 Genomes Project.

— Charles Lee

Compared to conventional cytogenetic methods such as karyotyping, array-based methods have greatly expanded the size range of detectable structural variants at fairly high sensitivity. Nevertheless, the lower bound on structural variant size has remained somewhere in the small kilobase range depending on the used array type, at least for widely used array platforms. In addition, array-based methods are limited to what is probed on the array; hence, for example, insertions can by definition not be discovered. Similarly, only unbalanced rearrangements (e.g., deletions and duplications) can be ascertained whereas arrays are literally blind for balanced rearrangements, including inversions and balanced translocations.

Due to these limitations, we do favor and almost exclusively use sequencing, which allows us to discover the full spectrum of structural variants at single base-pair resolution. We now tend to multiplex samples in a first pass, by using low coverage sequencing (about 2x coverage), to immediately screen a larger sample cohort at fairly adequate resolution. Samples of further interest are then selected and sequenced deeper.

— Tobias Rausch

We use a mix of sequencing and arrayCGH. Both approaches are in fact complementary, as we have previously demonstrated. There are variants out there that arrayCGH just wouldn’t be able to find, such as balanced rearrangements (inversions, translocations) and absolute copy numbers for duplications.

Sequencing is very powerful, yet it is very hard to find CNVs in regions with high repeat content; for those arrayCGH can give better results.

— Can Alkan/Tomas Marques

We’re using NanoString, which is probably a lot cheaper than doing whole-genome sequencing and enables you to look at high-copy sequences, so highly repetitive regions of the genome. We’re looking at some genes that have up to 1,000 copies in some primates, for example. So [these are] bits of the genome that never worked on arrays and people kind of ignore, and even sequencing-based approaches, unless you use the right methods, won’t work very well on them. We’re doing a few other basic things, like qPCR, but mostly focusing on NanoString and looking at weird, repeated regions.

We could use high-throughput sequencing, but the cost of the sample then works out to be expensive. And we’re doing more population-scale studies. So we’re doing some association studies looking at highly variable genes as risk factors for common diseases. And we’re looking at cohorts of hundreds of thousands of individuals and the prospect of whole-genome sequencing is not really viable unless you have enormous grants on thousands of individuals.

— Andrew Sharp

Q2: How do you account for biases in your approach of choice?

An integral part of our structural variant analysis pipeline is sequencing quality control, including the removal of duplicate reads and adapters, trimming of low-quality bases and scrutinizing the insert size distribution — crucial preliminary steps to achieve high accuracy and sensitivity in structural variant calling.

We also frequently employ mate-pair libraries with larger spanning coverage to detect SVs in highly repetitive regions because these regions are known to cause an ascertainment bias in
SV calling. Despite these efforts we still face a severe reference-bias that we hope to overcome in the future by gradually switching to a hybrid of mapping- and assembly-based SV calling methods, a strategy that is almost dictated by the steadily increasing read lengths.

— Tobias Rausch

Cross-validation. If two (or more) different approaches show the same thing, it is very likely that it is true. For the remainder, some experience and also sometimes visualization of the data helps to judge whether a call from one approach makes sense or not. If the call is an “interesting” one, you can always follow up with some other technique anyway.

— Can Alkan/Tomas Marques

NanoString is a digital counting technology, so it’s unique in terms of most CNV techniques. Arrays work by looking at intensity over a spot on a glass slide, and sequencing has various different methods — paired end mapping, split reads, read depth. NanoString actually has probes that bind in solution to your DNA. And they have essentially a fluorescent barcode on them. And that probe/target complex is then laid onto a cassette and is scanned and the computer physically counts the number of probes that you have. So it’s unique in CNV technologies in that it looks and says how many copies of this sequence are there in this genome by actually counting physically the number of molecules.

The weakness is that it’s using a probe to hybridize to a sequence, so it’s partly dependent on how good that probe is. And whether there is cross-hybridization that you may get with some probe sequences.

We’ve found that to be a fairly small problem. We’ve done a lot of cross-platform validations where we have samples that have been assayed by other technologies — array CGH, parallel ratio test, qPCR, Sequencing read depth. And when we can look at the same samples, we can take some Hapmap individuals who have been assayed many times by different methods, assay them with our technique using NanoString, and then see how well these two techniques match up.

Basically for everything we’ve looked at, the NanoString in most cases seems to outperform the other techniques. It’s probably most analogous to read depth estimates. But obviously if you’re only interested in, say, 10, 20, or 50 loci you don’t have to sequence the whole genome to get that. You can run a quick $20 to $30 NanoString assay and get your answer.

— Andrew Sharp

Q3: What are the best algorithms for calling structural variants in sequencing data?

The considerations depend on what your experiment is and what technology you’re using. And it depends on the kinds of things you want to look for. Most of these will require special pipelines. For example, if you want to do split-read mapping, you need a read mapper that is tolerant of large insertions or deletions within a read, and specifically look for those as your method of identifying CNVs. a lot of standard pipelines, if there’s a read that has a 100-base-pair insertion in it, will say, ‘that’s a mapping error,’ and throw that in the bin. But that may be exactly what you’re looking for. You need to tailor your analysis pipelines to whatever methods you’re using.

You need to think about your experiment and your biological question and then maybe decide what approach may be optimal.

— Andrew Sharp

There appear to be advantages and disadvantages for each computer algorithm. For example, some algorithms are designed to more accurately identify duplications while others are better at detecting deletions. There also appears to be an optimal size range for structural variant detection for many of the algorithms. Hence, for the 1000 Genomes Project, a combination of five different algorithms seemed to provide the most structural genomic variation data with the least amount of false positives. The program GenomeSTRiP [Genome STRucture in Populations] is very good at detecting common copy number variants. With respect to de novo assembly-based algorithms, I am still a bit skeptical about these as I haven’t seen enough validation information that convinces me that the false positive rates are minimal.

— Charles Lee

From the first results of the 1000 Genomes Project it has become clear that the different classes of structural variants such as deletions, duplications, inversions, and translocations are not equally likely to be found with current sequencing and analysis methods. In addition, cancer sequencing projects have shown that germline and somatic structural variants can have quite different properties, with some somatic rearrangements creating highly complex genomic signatures that might fool current methods.

Overall, deletion signatures are generally the easiest to discover, partly because multiple approaches, including read-depth, paired-end mapping, split-read alignment, and local assembly can be used for their discovery, and also since deletions typically cause a clear separation of copy-number states (1 or 0, compared to the normal diploid copy-number state 2). This is a great help for ascertaining deletions, and for germline variants, GenomeSTRiP is an excellent discovery tool.

High-level duplications are more difficult because the copy-number states are less well-separated and for dispersed duplications paired-end mapping and split-read analysis have not yet been shown to work efficiently. Read-depth-based methods are, however, unable to ascertain balanced rearrangements such as inversions and translocations. For these types of variants paired-end and split-read alignment methods appear to be the best choice; these two approaches can also ascertain complex somatic rearrangements in cancer. This rationale led us to develop Delly, an integrated paired-end and split-read analysis method, which we recently applied in the 1000 Genomes Project and in cancer resequencing initiatives, as part of our general efforts to comprehensively characterize germline and somatic structural variation.

— Tobias Rausch

There is no all-inclusive Swiss army knife algorithm that I would classify as “best.” all algorithms have different strengths and biases that depend on the class of SV (deletion, insertion, inversion, duplication, et cetera), the size (large versus small), and also the location (unique versus repetitive sequence).

In the pilot phase of the 1000 Genomes Project, we used 19 different algorithms on the same sequence data and there are substantial differences among all of them, even if we look at only the validated calls. If you want higher detection sensitivity, especially in repetitive regions, VariationHunter would be a good choice, but the false discovery rate would also be higher. If you need to keep your FDR low and if you have many genomes at low sequence coverage, GenomeStRiP would be theone to use. For segmental duplications your only choice for now is a read-depth- based method we developed earlier. This list can be expanded for all other algorithms like Cortex, CNVnator, Delly, Pindel, BreakDancer, RDXplorer, et cetera. What “the best algorithm” is depends on your needs and the data properties.

— Can Alkan/Tomas Marques

Q4: How do you characterize breakpoints for structural variants?

Further inspection of the region with split-read alignments. Some algorithms have this built-in, such as SPLItREad, Pindel, Cortex, and Delly. For other algorithms we follow up with a local assembly approach like TIGRA.

— Can Alkan/Tomas Marques

That partly varies on what it is you’re looking for. A lot of the stuff we’ve done has been looking at CNVs and recurrent microdeletions that happen in complicated bits of the genome, and that means that the rearrangements are mediated by non-unique homologous sequences at both breakpoints, and that makes them really hard to know.

With newer sequencing-based approaches the more simple CNVs will almost fall out. For example, if you’re doing split-read mapping, if you’ve found something with split reads you know the exact start and end points based on where your single how do you characterize breakpoints for structural variants? Read is now split.

A limitation of paired-end mapping is if you’re using, say, a library of 1 kb fragments, you’ll never be able to see an insertion that’s bigger than 1 kb because now both ends of it will maybe map to something that doesn’t exist in the genome.

— Andrew Sharp

One of the key reasons for characterizing structural variant breakpoints is to understand the mechanism of formation for a given structural variant as well as to garner insights into the ancestral origin of the variant. For example, when one looks at the breakpoints of a given structural variant among several individuals, one can infer independent origins if the breakpoints differ between individuals. Structural variants can recurrently occur by non-allelic homologous recombination events, as evidenced by the presence of repeats at the ends of a structural variant. Non-homologous end joining will leave either no pattern at the breakpoints or in some cases will leave what some refer to as a “molecular scar,” which is the addition of one base to three bases of DNA at the breakpoint ends.

— Charles Lee

Based on our split-read approach in Delly, we can directly ascertain any micro-homologies and microinsertions occurring at the breakpoint as long as these are smaller than half the read length. More complex breakpoint signatures with larger genomic shards inserted at the breakpoint still present a problem, and we believe local assembly may represent the best choice here. To infer SV ancestral states and formation mechanisms, including non-allelic homologous recombination and non-homologous end joining, we make use of our BreakSeq SV classification pipeline.

— Tobias Rausch

Q5: How do you perform clinical or functional interpretation of structural variants?

The key step is having parental DNA where, if the two parents are unaffected and the child is affected, we can simply ask, ‘do we see the same thing in the parents or other family members?’ the problem with that — and that’s the main approach used in clinical testing — is that it basically presumes that you have 100 percent penetrance, and that is not always true.

Another method is using CNV data from normal controls. So if you have a CNV that you see in an affected child, if you then have tested 1,000 unaffected individuals, and it’s absent in all those unaffected individuals, that again gives you evidence that it’s not a common CNV, that it’s not seen in the normal population. But, again, that presumes that there’s perfect correlation between genotype and phenotype [and] it also presumes that your screening of normal controls is perfect — that is, you’ve looked at how do you perform clinical or functional interpretation of structural variants? Enough individuals to have confidence that if was there I would have seen it.

— Andrew Sharp

For prenatal and postnatal testing, several criteria are used to assess pathogenicity of a copy number variant, including whether the CNV is present in an affected/ unaffected parent, the gene content of the genomic imbalance, and to some degree the size of the imbalance.

Several databases are also used to help with the assessment of pathogenicity of a CNV. A lot of the information in these databases is not absolute in nature, as issues such as incomplete penetrance … and variable expressivity need to be weighed in.

For sequencing-based tests, the data from the 1000 Genomes Project is also useful for knowing what structural variants have been identified in healthy individuals.

— Charles Lee

A basic method that we use to relate somatic structural variants to phenotypes is gene set enrichment analysis. For instance, in our recent study on genomic rearrangements in a pediatric brain cancer we observed a more than 2-fold enrichment of known oncogenes in the highly amplified segments of the tumor, which we inferred to be caused by chromothripsis.

Another strategy to elucidate genotype-phenotype relationships is to overlay all structural variants of patients sharing the same phenotype to identify a critical genomic region disrupted in all patients.

Beyond these gene-centered analyses we also try to integrate the somatic structural variants and the somatic point mutations on a gene-network level to identify a subset of mutated genes that jointly alter a critical biological pathway.

— Tobias Rausch

Q6: What methods do you use to validate structural variants identified in array or sequencing studies?

I’m of the opinion that a lot of the time these things don’t require validation because the technology is becoming robust enough that I think they stand in their own right.

Now, that’s not always true because some things can get screwed up. So if you believe something really stands out — if it’s a barn door CNV that you’re really confident is true — then I don’t think you need validation. But if it’s something small and it’s kind of borderline, you may want to go in and be sure that it’s a real event with some second technique. So that could be FISH, it could be qPCR, it could be simply repeating the hybridization.

— Andrew Sharp

We often use sequencing (of breakpoints and using read-depth information) for structural variants identified by arrays and array-based methods for structural variants (i.e., specifically CNVs) identified solely from sequence analyses.

qPCR/multiplex ligation-dependent probe amplification is also commonly used for validation of copy number variants identified by either method. Another method that can be used is fluorescence in situ hybridization, which is excellent for visualizing duplications, deletions, translocations, and insertions of greater than 10 kb in size.

If the breakpoints of inversions are known, two-color break-apart assays will yield a “combinatorial” color when no inversion has taken place and separate colors when an inversion has occurred.

— Charles Lee

We mostly use PCR and capillary sequencing, both to validate interesting candidate structural variants and also to determine a false discovery rate of our SV calling algorithm. Selectively, we also employ FISH. In the future, we plan to investigate technologies such as the NanoString nCounter analysis System and orthogonal sequencing approaches such as Pacific biosciences for validations.

— Tobias Rausch

Usually, for unbalanced events (such as duplications and deletions) array CGH has been the standard for validating sequence-based techniques. However, sometimes a simple PCR can do the job if you are exploring non-repetitive regions of the genome, or qPCR to check for simple copy number differences. FISH might serve as well for large events, including large inversions, and when cell lines are available, but FISH is not a high-throughput method as you cannot perform validations of hundreds of potential SVs. Still, the most difficult SV for validations are inversions. Being a balanced event, so far no high-throughput technique has been systematically explored.

— Can Alkan/Tomas Marques

List of resources

A compendium of papers and online resources to help address your structural variant analysis questions.

Publications

Albers Ca, Paul Ds, Schulze H, et al. (2012). Compound inheritance of a low-frequency regulatory SNP and a rare null mutation in exon-junction complex subunit RBM8A causes TAR syndrome. Nature Genetics. Feb 26;44(4):435-9, S1-2.

Conrad DF, Pinto D, Redon R, et al. (2010). Origins and functional impact of copy number variation in the human genome. Nature. Apr 1;464(7289):704-12.

Hillmer AM, Yao F, Inaki K, et al. (2011). Comprehensive long-span paired-end-tag mapping reveals characteristic patterns of structural variations in epithelial cancer genomes. Genome Res. 2011 May;21(5):665-75.

Handsaker RE, Korn JM, Nemesh J, Mccarroll SA, (2011). Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nature Genetics. 43, 269–276.

Lee E, Iskow R, Yang L, et al. (2012). Landscape of somatic retrotransposition in human cancers. Science. 2012 Aug 24;337(6097):967-71.

Northcott PA, Shih DJ, Peacock J (2012). Subgroup-specific structural variation across 1,000 medulloblastoma genomes. Nature. Aug 2;488(7409):49-56.

Park H, Kim JI, Ju YS, et al. (2010). Discovery of common Asian copy number variants using integrated high-resolution array CGH and massively parallel DNA sequencing. Nature Genetics. May;42(5):400-5 .

Rausch T, Jones Dt, Zapatka M, et al. (2012). Genome sequencing of pediatric medulloblastoma links catastrophic DNA rearrangements with tP53 mutations. Cell. Jan 20;148(1-2):59-71.

Rausch T, Zichner T, Schlattl A, Stütz Am, Benes V, Korbel JO. (2012). DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. Sep 15;28(18):i333-i339.

Talkowski ME, Ernst C, Heilbut A, et al. (2011). Next-generation sequencing strategies enable routine detection of balanced chromosome rearrangements for clinical diagnostics and genetic research. Am J Human Genet. 2011 Apr 8;88(4):469-81.

Teague B, Waterman MS, Goldstein S, et al. (2010). High-resolution human genome structure by single-molecule analysis. Proc Natl Acad Sci USA. Jun 15;107(24):10848-53 .

Software

Breakdancer: http://gmt.genome.wustl.edu/breakdancer/current/
BreakSeq: http://sv.gersteinlab.org/breakseq/
CNVnator: http://sv gersteinlab.org/
Cortex_var: http://cortexassembler.sourceforge .net/index_cortex_var.html
DELLY: http://www.embl .de/~rausch/delly .html
GenomeSTRIP (Genome STRucture in Populations): http://www.broadinstitute.org/software/genomestrip/
Pindel: http://www.ebi.ac.uk/~kye/pindel/
RDXplorer: http://rdxplorer.sourceforge.net/
SPLITREAD: http://splitread.sourceforge.net/
VariationHunter-Sc: http://compbio.cs.sfu.ca/strvar.htm

Databases

Database of Genomic Variants: http://projects.tcag.ca/variation/
Database of genomic Structural Variation (dbVar): http://www.ncbi.nlm.nih.gov/dbvar/
DECIPHER: http://decipher.sanger.ac.uk/
International Standards for Cytogenomic Arrays Consortium Database: http://www.iscaconsortium.org/