NEW YORK (GenomeWeb News) – Researchers involved with the St. Jude Children's Research Hospital - Washington University Pediatric Cancer Genome Project have developed a new computational tool for finding structural variation down to the nucleotide level in cancer genomes.
The team described the algorithm — dubbed Clipping Reveals Structure (CREST) — in the early, online edition of Nature Methods this weekend. Using CREST, they tracked down nearly 90 structural changes in tumor samples from five children with a type of acute lymphoblastic leukemia known as T-lineage ALL. By uncovering more than two-dozen known and 50 previously undetected structural variants in a previously sequenced melanoma cell line, the researchers demonstrated that their method could detect structural variants in adult cancers as well.
The team recognized the need for the tool while analyzing data generated for the pediatric cancer genome effort, launched early last year, senior author Jinghui Zhang, a computational biologist at St. Jude's, told GenomeWeb Daily News.
At the time, she explained, researchers were using several strategies to look for structural changes in the genomes and working to come up with programs to try to improve the accuracy of existing methods. In the process, though, they found clues that these approaches might overlook some key variants in the genomes.
"When we were analyzing one of the T-ALL samples, we actually stumbled upon a very important cancer gene," Zhang said. "Just by accident we found that there's structural variation there that's missed by existing programs.
"I realized that we need to develop our own algorithm," she added, "in order to make sure that we don't miss a lot of real hits in cancer genomes."
The resulting CREST algorithm represents the first time that so-called soft clipping signatures have been used to find structural variants, Zhang said. These soft clips are sections of some sequence reads that don't align to the human reference genome. Rather than masking these bits of DNA, she explained, the CREST algorithm uses soft clip information to help find structural changes.
"Soft clips can arise for various reasons," Zhang said. "It can be poor quality reads and it could be just some artifacts … But we found out that structural variations also have this signal."
Depending where these soft clips turn up in the genome, and the orientation of the breakpoints they contain, the study authors explained, soft clips provide information on everything from small insertions and deletions to inter- and intra-chromosomal translocations and sequence inversions.
For instance, when the researchers used CREST to assess genome sequence data for five T-ALL tumor-normal samples sequenced using paired-end sequencing with the Illumina GAIIx for the Pediatric Cancer Genome Project, they were able to narrow in on 110 apparent structural variations in the tumor genomes.
The team's subsequent PCR amplification and Sanger sequencing of 107 of these regions suggested that 89 of them — 82 percent of sites tested — were authentic structural variants. These included 31 inter-chromosomal translocations, 19 intra-chromosomal translocations, 22 insertions, 16 deletions, and a single inversion.
On the other hand, researchers reported, analysis of the same sequences with the BreakDancer algorithm, which relies on discordant mapping information, identified more than a thousand potential structural variants overall, but fewer than half — just 27 — of the 89 CREST-detected and validated structural changes.
CREST assessment of a melanoma cell line known as COLO-829, which had been previously sequenced and analyzed by a Wellcome Trust Sanger Institute-led team, meanwhile, uncovered 26 of 37 known structural variations in the melanoma genome, along with 50 more structural variants not found before.
"The reason that we incorporated the melanoma data is to make sure that [CREST] does not only run … on the pediatric cancer, but also works for adult cancer," Zhang said.
The strategy is compatible with any sequencing platform, she added, provided relatively long reads — at least 75 base pairs — are available.
Simulated whole-genome sequence data based on information from the 1000 Genomes Project suggests the new algorithm has a false negative rate between 22 and 27 percent and a false positive rate of around three percent.
Such estimates, particularly the false negative rate, may be on the high side, Zhang said, since the researchers have found clues that germline tissue actually houses more variants in segmentally duplicated regions than somatic tissue.
Consequently, she said, researchers are now doing follow-up studies exploring the use of CREST for finding structural variation in germline genomes. Researchers at Washington University and elsewhere are also using CREST to look for structural variants in both pediatric and adult cancer genomes.