NEW YORK (GenomeWeb) – Researchers from Simon Fraser University in British Columbia and Indiana University in the US have developed software for compressing SAM and BAM files called DeeZ that they claim offers better compression ratios, requires less memory, and has a shorter run time than existing solutions for the task.
DeeZ, which was developed under the auspices of the Canadian Cancer Genome Collaboratory, helps reduce time and bandwidth bottlenecks associated with storing and transferring large files of NGS data, Cenk Sahinalp, one of the developers of the method and a professor at both Simon Fraser and Indiana Universities, told BioInform.
In a correspondence piece describing DeeZ published this week in Nature Methods, Sahinalp and colleagues wrote that their approach outperforms commonly used solutions like SAMtools, returning compression ratios that are "on par" with arithmetic coding-based (AC) compression programs such as Quip and Samcomp. Moreover, it requires less memory and has a shorter run time than the latter methods because with DeeZ, users don't have to decompress the entire file in order to locate regions of interest.
DeeZ works by "lower[ing] the cost" of representing common differences between raw reads that map to a specific genomic locus and the reference genome, according to its developers. It does so by "obtaining the consensus of the reads mapped to a specific locus (implicitly assembling the donor genome by the use of mapping information) and encoding the differences between the consensus … contigs and the reference genome once," they explained in their correspondence.
"As there is no difference between the consensus contigs and the reads with the exception of mapping errors or highly allelic regions, DeeZ encodes the positional information of each read within only the relevant contig," they said. Furthermore, DeeZ uses "a unique compression method for each field of the SAM record in order to exploit its specific properties," they added.
The technical details of the approach are provided in the supplementary note that accompanies the Nature Methods piece. Summarizing these details, Sahinalp told BioInform that some methods encode reads based on differences between the individual reads and their corresponding location in the genome, but DeeZ tries to assemble all the reads that correspond to a particular genomic locus into contigs of relatively short length. The method then encodes the actual location of the assembled contig in the reference genome. Any differences between the contigs and the constituent reads — the result of mapping errors, for example — are encoded and stored separately.
A second benefit of DeeZ is its "random-access capability" which involves "encoding the input SAM or BAM file in a block-by-block manner via AC for the quality scores and mapping locations and via Gzip for the other fields." The main benefit of this feature is that users don't need to decompress the entire SAM or BAM file before searching for parts of the genome they are interested in, according to the developers.
"It allows one to, for example, check out a compressed genome on a genome browser without … decompressing the whole file," Sahinalp said. "You can look at one genome segment at a time [or] check out specific mutations — whatever the mapping file provides." The upshot of that, is that less memory is needed to run DeeZ and the software has a shorter run time than some existing solutions, the researchers wrote.
The paper includes a table that shows the results of comparison tests that pitted DeeZ, run with various settings, against Gzip, SAMtools, Scramble, and reference and non-reference versions of both Quip and Samcomp on the basis of compression ratio and compression/decompression run time. The researchers used the tools to compress and decompress files of bacterial RNA-seq as well as human HiSeq and RNA-seq library data.
The results show that Deez mostly performs better or on par with existing solutions. For example, considering a human genome sequenced at 40-50x coverage, Deez "can get more than 10x improvement in the memory footprint … in a few hours" compared to SAMtools, depending on the compute infrastructure used to run the solution, Sahinalp said.
The results also show that Deez had the fastest compression speed on the bacterial RNA-seq and human HiSeq data but it had the slowest compression speed when it was used on the human RNA-seq data "because many reads of eukaryotic RNA-seq originated from splice junctions," according to the paper.
It also had better or comparable compression ratios for all datasets with the exception of Samcomp which had a slightly better performance than DeeZ in this category. However, "Samcomp does not provide random-access ability and does not compress all fields in the SAM format" both of which are features DeeZ had. Finally, DeeZ had decompression times across the board that were comparable with most programs except for SAMTools, Gzip, and Scramble, which decompressed data faster.
Moving forward, the developers plan, among other things, to enable the software to work with other kinds of file formats such as the one used by Pacific Bioscience for its sequencers, Sahinalp said.