With the cost of sequencing coming down, looking for structural variation genome-wide is getting easier. Typically, to find structural variants — everything that isn't a SNP, including indels, copy number variants, inversions, and translocations — people have been using array CGH, among other array-based tools. Now, however, high-throughput sequencing analysis is possible and provides a way to predict structural variants more accurately. Challenges to data analysis remain, though, including how to incorporate short inserts into current mapping algorithms and how to take into account varying coverage, insert size, and read length data from different next-gen platforms. To this end, work out of Elaine Mardis' lab led by Ken Chen has resulted in a new set of algorithms that improve common variant detection and detection of somatic variants in tumor versus normal samples.
Collectively called BreakDancer, the software package consists of two complementary algorithms. Evaluating paired-end reads from an Illumina sequencer, the scientists showed that BreakDancerMax detects five types of structural variants, including deletions, insertions, inversions, and intrachromosomal and interchromosomal translocations from pooled or individual DNA samples. BreakDancerMini detects small indels, typically 10 to 100 base pairs, which BreakDancerMax misses. The software was published in a paper appearing in August in Nature Methods.
In the study, they compared the mapping capabilities of their algorithm to similar ones, specifically Evan Eichler's VariationHunter and Michael Brudno's MoDIL, on MAQ map files of the Yoruban genome. In general, Chen says, BreakDancerMini showed higher sensitivity and specificity than either of the other algorithms. They also applied BreakDancer to detecting somatic variations in an AML tumor sample as well as detecting variants in the 1,000 Genomes dataset.
What makes BreakDancer novel, Chen says, is that the other algorithms target indels longer than 100 base pairs. "What we have intended to do is to cover the entire range, from 10 base pairs to virtually no limit," he says, significantly increasing the range of indels that can be detected.
Both Max and Mini support pooled samples, or multiple samples and libraries. For looking at variation in the AML samples, for example, they were able to look at a pair of genomes. "We were interested in finding somatic variations, so basically we compared tumor genome versus the normal genome," Chen says. Because most cancer is caused by somatic alterations in the tumor genome, it's important to be able to compare tumor to normal. In this, BreakDancer improved the specificity of somatic variant prediction by eliminating germline, or inherited, variants. "This is a novel application," Chen says, because it can actually perform a "head-to-head" comparison of tumor and normal samples.
After running BreakDancer on the AML genomes to predict approximate size and location of structural variants, the researchers used assembly programs Velvet and Phrap to refine their predictions. By mapping all the reads back to their predicted variant loci, they were able to confirm that the variants do exist. "We found by applying this BreakDancer detection algorithm and assembly afterwards [to tumor samples], we can very efficiently discover and secure those predictions," Chen says. "[The] combined approach is [more] efficient and accurate than using either approach in isolation."
Development of the tools began more than a year ago, and the algorithm has already been applied to many large-scale sequencing data. Chen says it will continue to evolve alongside big sequencing projects, especially the 1,000 Genomes Project.