Researchers at Life Technologies and the Indiana University School of Medicine have published an informatics approach for detecting cancer gene fusions in RNA sequencing data from SOLiD sequencers.
The approach includes a new algorithm, dubbed Suffix Array Spliced Read, or SASR, which was designed to detect reads that span fusion junctions. The new algorithm has been included in Life Tech's LifeScope software for the 5500 Series SOLiD system (BI 5/27/2011).
The method and its applications are discussed in an article published in a recent issue of PLoS Computational Biology.
The SASR algorithm provides "an unbiased approach to gene fusion and splicing discovery," which "greatly increase[s] the sensitivity and specificity of junction detection," Milan Radovich, an assistant research professor of surgery at IU's School of Medicine and a co-author on the study, said in a statement.
The method also helps researchers "harness more data from their RNA-sequencing projects," than current methods can, he added.
Radovich, whose research focuses on using next-generation sequencing to help identify new drug targets that can treat triple-negative breast cancer, inflammatory breast cancer, and thymic malignancies, told BioInform this week that the IU researchers tested SASR on their breast cancer datasets and provided technical feedback to Life Tech's researcher as part of the algorithm's development process.
Radovich and his colleagues have used the algorithm to locate splice junctions and fusion genes in triple-negative breast cancer samples, he said. In addition to developing better and more personalized disease therapies, the group hopes to identify biomarkers for early detection, he said.
Although this particular pipeline is meant for SOLiD's color-space reads, Life Tech's informatics team is working on a version of the SASR algorithm that will call gene fusions in other kinds of reads, including those from Life Tech's Ion Torrent sequencers, Onur Sakarya, a bioinformatics scientist at Ion Torrent and a co-author on the paper, told BioInform this week.
"It will be a challenge for us to do RNA-seq with longer reads and also the higher throughput," generated by Ion Torrent sequencers, so "we might need some new algorithms," he said.
Under the Hood
Life Tech's Sakarya explained that LifeScope's approach independently analyzes spliced single reads, which span exon junctions, and paired-end reads, which bridge exon junctions, making it possible to detect fusions from both single-fragment or paired-end RNA-seq experiments.
The paper compares this approach to the one adopted by tools like FusionSeq and deFuse, which use only paired-end alignments as "initial evidence" and then "apply spliced read mapping on the candidate regions."
This approach generates "a lot of false positives," which make it difficult to locate the true gene fusions, Sakarya said.
"There are about 200,000 exons in the human genome and you can have any combination of the two exons in principle to give you a fusion area," he explained. Since "you cannot try all the possible combinations ... we wanted to make something more specific."
In the PLoS Computational Biology paper, Sakarya et al explain that they prepared paired-end RNA libraries from the Universal Human Reference; the Human Brain Reference; and the MCF-7 breast cancer cell line, and then mapped these reads to reference sequences using an approach in which each pair of reads was mapped to "genome, junction, exon and filter references and paired with a pairing quality value."
Next, the team used a "bridge and span" approach to locate splice junctions in their data. Specifically, they used "bridge evidence found by paired-end reads in which the forward [and the reverse] read[s] map on [separate] exon[s]; Span evidence found by single reads of paired-end reads in which the read alignment spans the breakpoint of known and putative splice junctions; and fusion span evidence found by fusion alignments spanning [the] hypothetical breakpoints of two exons discovered using the SASR aligner, which assesses all exon-exon combinations in the genome," the paper explains.
The combination of these three tactics helped the investigators identify 133,000 known and 15,315 putative splice junctions and between 5 and 56 candidate fusion breakpoints.
By way of comparison, the team contrasted its candidate splice junction calls using the MCF-7 cell lines with those provided by TopHat. Among other results, LifeScope detected 123,423 known RefSeq junctions, as well as 15,074 potentially novel junctions that weren't called by TopHat. The team determined that "more than half" of these LifeScope-only calls were true positives based on a metric called the Junction Confidence Value. TopHat, meantime, detected 106,692 known RefSeq junctions and 49,586 novel junctions, some of which could be false positives, the researchers note in the paper.
These results suggest that "both of these tools may be used at the same time if you want to do a complete study of your sample because each of them might be missing some calls," Sakarya said.
LifeScope's approach identified 40 gene fusions in the MCF-7 breast cancer cell line, and the researchers reported that they were able to validate 36 of those gene fusions using TaqMan assays.
By comparison, FusionSeq identified only six of the 40 gene fusions identified by LifeScope, although the authors note that FusionSeq does not handle color-space reads or "data with different read length pairs."
Sakarya told BioInform that the team should be able to do a more fair comparison with tools like FusionSeq when it develops the base space version of the SASR algorithm.
The team also reported that two MCF-7 gene fusions — an intra-chromosomal gene fusion involving the estrogen receptor alpha gene ESR1 and another one involving ribosomal protein S6 kinase beta-1, or RPS6KB1 — were recurrently expressed in several breast tumor cell lines and a clinical tumor sample.
The presence of the same gene fusion in multiple samples could indicate that it’s a causative mutation for the disease, Sakarya said, although in this case, he noted, the researchers suspect that these particular mutations may be unique to the cell lines used in the analysis and not indicative of breast tumors.
LifeScope's approach provides "a good tool" for detecting fusions in SOLiD data, Huanying “Gary” Ge, a scientist in Amgen's Genome Analysis Unit, commented in an e-mail to BioInform this week.
Ge was part of a team that published a method not discussed in Life Tech's paper, dubbed FusionMap, which does not rely on paired-end reads to call gene fusions.
FusionMap, which also does not handle color-space reads, "aligns junction-spanning reads directly to the genome without prior knowledge of potential fusion regions and detects fusion events at base-pair resolution," Ge explained.
He suggested that both methods appear to similarly deal with paired-end and single fragment reads separately; however, rather than using suffix arrays, FusionMap detects fusion junctions by "[splitting] each candidate fusion junction-spanning 'seed' read and [aligning] them to the reference genome directly with [or] without the help of gene model annotations ... it then identifies possible fusion junction(s) based on the consensus of mapped fusion positions from seed reads," he explained.
Further commenting on LifeScope's approach, Ge noted that because SASR's "search space contains suffixes from all exons rather than exons in potential fusion regions that can be inferred from discordant read pairs ... I am not sure how fast and simple the search is."
In addition, because the approach only identifies "fusion junctions connecting two known exons," it is unable to "detect fusion event[s] [that occur] in the middle of exons," he pointed out.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.