This article has been updated from a version posted July 6 to include additional information from the developers.
The Translational Genomics Research Institute has launched a data-compression technique that can condense genomic data by as much as 80 percent, allowing researchers to store, analyze, and share large quantities of sequence data using less space and at lower costs than current methods, according to its developers.
Developed specifically for genomic sequence data, Genomic SQueeZ, or G-SQZ, is based on the Huffman coding algorithm, a method originally developed in the 1950s that uses shorter codes for the most frequently occurring pieces of information. The G-SQZ encoding method and software are described in a paper published today in Bioinformatics.
“There is so much interest in next-gen sequencing for obvious reasons,” Waibhav Tembe, the paper’s lead author and TGen’s senior computational scientist, told BioInform. “When you get hundreds of gigabytes of input files per run, how do you analyze that much data? How do you manage, transfer, and store all this data?”
Tembe said that the concept behind the method "is that you look at the frequency of the symbols," which, in this context, "are basically quality scores and bases.” By assessing the frequency of each base-quality pair, you can then "come up with a bit level of binary representation" for these symbols, he said.”
As an example, he explained that in a file containing 3 billion letters, a quality score of 20 might occur a hundred thousand times and a quality score of 30 could occur two hundred thousand times. “If it’s a high-frequency data, you would use a short binary code to represent the [data] and if it is a low-frequency data, you would use a longer binary code.”
He continued. “Instead of having a fixed number of bytes per base and quality pair, you would use a variable number of bytes. Effectively it would reduce the total number of bytes needed to encode the entire file.”
According to Tembe, an additional feature of the method is that it does not change the order in which the data appears as do some other compression algorithms, such as gzip and bzip.
In the paper, "G-SQZ outperformed gzip in almost all the cases and it came close to outperforming bzip,” Tembe said.
For example, G-SQZ was able to compress a 4.8 gigabyte Illumina file to 1.54 gigabytes, while gzip compressed the same file to 1.71 gigabytes and bzip2 compressed it to 1.44 gigabytes. For SOLiD data, G-SQZ condensed a 40.9 gigabyte file to 9.64 gigabytes, while gzip and bzip compressed the same file to 11.5 gigabytes and 9.35 gigabytes, respectively.
Tembe noted that even though G-SQZ didn't outperform bzip in overall compression, “the thing to remember is that G-SQZ keeps the data order and allows selective content access, which is not something other algorithms might readily give you”
Other algorithms "are based on optimizing compression only. G-SQZ sort of takes a suboptimal approach but at the same time makes the analysis piece easier,” he said.
In addition to analyzing the frequency of DNA nucleotides, TGen said G-SQZ can encode annotation information, including the data's quality, as well as erroneous entries, such as unidentified bases.
Since G-SQZ maintains the relative order of the data it compresses, the condensed data can be split into smaller "chunks" so that multiple processors can decode and analyze different parts of the same file simultaneously
Xiaohui Xie, a professor in the school of information and computer sciences at the University of California, Irvine, said the “method seems to be a straightforward application of Huffman encoding” and noted that it “is certainly a welcome addition to the compression tools currently available.” He added, however, that the authors "correctly point out the method is suboptimal compared to several other existing algorithms.”
Last year, Xie was part of a team of scientists that developed a method for compressing genomic data that used a series of techniques to reduce a human genome from 3 gigabytes to 4 megabytes — small enough to be sent as an e-mail attachment (BI 01/30/2009).
TGen has made the program and source code freely available for researchers and academics, while for-profit entities must contact the institute for a license.
Tembe said that TGen will continue to develop the method further and has plans to market the product in the future. The institute has filed a patent application for the technology.
“At present we are not following a completely open source model because we have some ideas that are already in the testing phase and not released yet,” he said. “But we are making all the binaries available for all not-for-profit use and for academics and researchers.”
Tembe said that TGen is currently encouraging third-party developers to take advantage of the method.
“To that end we are working towards making a library available,” he said. For example, “a sequence alignment algorithm can include the G-SQZ library as one of the steps. The library will decompress the data on-the-fly in memory and send it to the software. This is one way [G-SQZ] can be used and this is where there are opportunities, in addition to obvious opportunities in reducing storage costs.”
The software is available for download here.