The ability to understand genomic structural variation through the use of next-generation sequencing promises to deliver crucial diagnostic tools to make personalized medicine a reality. But according to Simon Fraser University's Cenk Sahinalp and his colleagues, the structural variant data that next-generation sequencing produces will be useful — for both drug design and diagnostics — only once recurrent biomarkers in patient subgroups can be identified. Unfortunately, studies on specific cancer types have so far been unable to uncover recurrent structural biomarkers due to the inability of current bioinformatics tools to effectively and accurately pinpoint or prioritize key structural variants.
"Even if you sequence the same genome twice and find the structural variants with respect to the reference genome with all these available mapping tools, you will make some mistakes," Sahinalp says. "Then, you are comparing one experiment next to another on the same genome because the number of structural variants between a typical genome and a reference genome are on the order of several thousand. You make hundreds of errors, and the error rate can just keep accumulating. So, you are talking about several hundred false positives in structural variation discovery."
The conventional approach to structural variation discovery primarily utilizes paired-end sequencing. Inserts from a donor genome are read at two ends and then later aligned to a reference genome. Varying distances between the end reads then indicate whether insertions or deletions are present — as long as the mapping loci are identified correctly. Popular variation discovery software tools like PEmer, Pindel, and BreakDancer all take this type of approach by only producing the best mapping of each read. While these tools are effective at analyzing unique regions of a genome, possible multiple alignments in repeat regions are ignored. This has been shown to result in a considerable reduction in the accuracy of variation discovery.
During the last three years, researchers have developed clustering techniques to address the limitations of the older methods. Newer tools like Variation-Hunter — a tool for analyzing mobile-element insertions — produce multiple mapping of paired-end reads to a reference genome. This soft cluster technique is optimized to detect insertion and deletion polymorphisms to resolve repetitive regions of the genome. However, if investigators then wish to identify common structural variations across many human genomes, they must take the additional time-consuming step of merging the predicted structural variations together to determine whether or not two or more donor genomes confirm those variations.
Sahinalp and his main collaborator, the University of Washington's Evan Eichler, are attempting to reinvent the current approach to structural variant discovery. In a Genome Research paper published in December, Sahinalp and Eichler presented an innovative package of algorithms, collectively called CommonLAW — Common Loci structural Alteration discovery Widget — that aim to enable researchers to simultaneously predict structural variation in multiple donor genomes.
It is different from other structural variant analysis tools because it assigns priority to the clusters that have support from reads coming from two or more donor genomes, and low priority to the clusters that have support from reads coming from just one donor genome. "If a potential cluster that has reads from two donor genomes become an actual cluster, this cluster will imply a structural variant between the two donors and the reference. However, it will imply no variant between the two donors themselves," Sahinalp says. "In short, Common-LAW is unique in the mathematical formulation it uses for the second phase of soft clustering — which assesses each cluster to make some of the 'soft' clusters 'hard' clusters."
In their paper, the team compared CommonLAW's performance to its predecessors' in analyzing the genomes of a Yoruban family living in Nigeria. The whole-genome shotgun sequences of the Yoruban genomes were downloaded from the US National Center for Biotechnology Information database and compared to the human reference genome. After a careful inspection of VariationHunter's clusters on the trio, all 410 novel Alu inserts turned out to be false positives.
Using CommonLAW's algorithms, the number of de novo Alu insertions was reduced to zero from the top 3,000 insertions predicted by VariationHunter. When it came to predicting deletions, the numbers produced by the older, two-step approach involving VariationHunter were similarly exaggerated among the top predicted 15,000 deletion loci. Whereas the two-step approach predicted that 111 out of the 15,000 deletions were de novo events, the CommonLAW method predicted an average of roughly 37.
Moving forward, Sahinalp says he intends to continue working with Eichler to scale CommonLAW, adding more algorithms to handle larger data sets and more tumor samples.
"CommonLAW is good for trios, but for large cohorts there are a lot of technical issues and it's not as successful as we would want it to be," he says. "We are also dealing with tissue samples from the same individual over time through single-cell sequencing and we'd like to know if there are differences, such as how a tumor evolved not only spatially, but over time. These are huge challenges, and, to be honest, there are no available algorithms to find all those structural differences accurately."