Researchers at St Jude Children’s Research Hospital have developed an algorithm to detect structural variations in next-generation sequence data that they say improves on current methods used to find these aberrations.
Clipping Reveals Structure, or CREST, was developed by investigators involved in the St. Jude Children’s Research Hospital – Washington University Pediatric Cancer Genome Project and is being used to identify chromosomal rearrangements and DNA insertions or deletions that are unique to cancer.
In a recent paper published in Nature Methods, the researchers presented the results of a study in which they applied the algorithm to datasets from the bone marrow of cancer patients at St. Jude, as well as a skin cancer cell line.
The algorithm is able to pinpoint the exact breakpoints of structural variants at the nucleotide level — a capability that the developers claim is currently not possible with other structural variant-detection tools, such as BreakDancer and Geometric Analysis of Structural Variants.
The researchers explained in the paper that current methods for identifying structural variants with NGS data map reads from two ends of a sequence fragment to a reference human genome and use “discordances in distance, orientation, and/or mapping status” to pinpoint these abnormalities.
However, they noted, these methods “infer the approximate genomic locations of [a structural variation] but do not pinpoint their exact breakpoint at the nucleotide level.”
Furthermore, these methods “tend to generate a high frequency of false positives when applied to experimental data because of the presence of PCR and/or sequencing artifacts and the inherent difficulty of accurately mapping sequences in repetitive regions,” they wrote.
CREST provides an alternative approach that is based on “directly mapping SV breakpoints at the nucleotide level” without “relying on the discordant mapping of paired-end reads,” they wrote.
The key to CREST is "soft clips" — DNA segments that don't align to the reference human genome during the mapping process and are therefore masked. CREST's developers determined that as long as the read length from an NGS instrument is greater than 75 base pairs, "these soft-clipped subsequences can be of sufficient length to map unambiguously to a different genomic location, thus identifying the second breakpoint for a putative SV."
CREST uses the "soft-clipping signature present in reads that straddle a structural variation breakpoint while other algorithms use discordant mapping for structural variation detection,” Jinghui Zhang, an associate member of St. Jude’s department of computational biology and the study’s senior author, explained to BioInform in an e-mail.
The soft-clipping signature, coupled with the longer read lengths generated by current next-generation sequencing technologies, "allow us to directly map the exact SV breakpoints,” Zhang said. “By contrast, paired-end mapping can only infer approximate SV breakpoints and the false positive rate can be high due to artifacts in library construction or repetitive regions.”
CREST has also been optimized to identify “somatic structural variations in cancer genomes in comparisons between a tumor genome with its matching normal,” she said.
CREST is the first tool to use soft clips to identify fusion proteins, which are hybrid proteins that are made when genomic rearrangements fuse pieces of two genes, Zhang said in a statement. Fusion proteins can disrupt normal cellular controls and lead to the unchecked cell division that marks cancer.
She explained that the need for new ways to identify the genomic variations that lead to cancer became clear shortly after the St. Jude - Washington University genome project began in 2010.
During the project, which was launched to sequence and compare the complete normal and cancer genomes of 600 young patients, her team manually detected a chromosomal rearrangement involving a known cancer gene that existing tools failed to detect.
Zhang further noted that existing methods miss 60 percent to 70 percent of structural rearrangements in tumors.
A Good Clip
As a first step, CREST collects all soft-clipped reads from an NGS dataset that are then used to define possible structural variation breakpoints.
Each breakpoint is considered to be the first breakpoint of a potential SV. The corresponding one is identified by applying an “assembly-mapping-searching-assembly-alignment procedure,” the research team wrote. “If the alignment shows high identity between the second contig and the first breakpoint, then the two breakpoints are considered to form a putative SV.”
Next, these SVs are filtered to remove false positives. The investigators explained in the paper that for each variant, “the distance between the second contig to the first breakpoint is required to be within a short distance” — a default of 15 base pairs —“to ensure it maps back to the first breakpoint.”
CREST exports three output files including a report file that records the breakpoints of SVs at base-pair resolution and the number of soft-clipping reads and genes located across the breakpoints; a template file for experimental validation with 1,000 flanking nucleotides of each breakpoint; and an extensible markup language output file that displays the assembled contigs.
According to the Nature Methods paper, when CREST was applied to whole-genome sequencing data from five pediatric T-lineage acute lymphoblastic leukemias and a human melanoma cell line, it was able to identify 160 somatic structural variations.
Specifically, the team used CREST to identify 110 new structural differences in the cancer genomes of five St. Jude patients with T-ALL, of which Sanger sequencing confirmed 89. The remaining 21 were false positives.
CREST was also tested on a published whole-genome sequencing dataset from a melanoma cell line that reported 37 validated SVs. It identified 76 variations, including 26 of the previously reported 37.
The software also identified 50 new variations. Out of 20 of those selected for Sanger sequencing, 18 were validated.
The researchers also compared CREST’s performance on the T-ALL dataset to BreakDancer, which relies on a paired-end discordance mapping algorithm. They reported that BreakDancer only identified 27 out of the 89 validated variations that CREST identified. They further noted that although BreakDancer identified another 1,037 variations, they were later shown to be false positives.
The team also ran the dataset through a second mapping algorithm that successfully detected 76 of the validated structural variations. A third program, dubbed Pindel, which uses unmapped reads across insertion-deletion breakpoints, detected only five of the 89 validated SVs found by CREST.
As a next step, Zhang’s team is working to improve CREST’s performance for RNA-seq analysis.
In addition, the investigators are working on “incorporating unmapped reads along with soft-clipped reads for structural variation analysis” as well as on an automated annotation tool to assess the effect of these variants, she said.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.