|Matthew Dublin is a senior writer at Genome Technology.|
Novel Algorithm Teaches Old Dog New Compression Tricks
A team of bioinformatics researchers at Illumina Cambridge in the UK have developed a new compression algorithm for next-gen sequencing data that improves upon the old compression standby Burrows-Wheeler transform, or BWT, algorithm. The 18-year-old BWT serves as the basis for numerous compression and data indexing methods. However, because of its design, the technique cannot be applied successfully to large datasets typically produced by large genome sequencing runs.
The team, led by Illumina's Anthony Cox, describe a novel algorithm that can allow the BWT of genome data to be analyzed using only "moderate" hardware, i.e. a workstation or a small cluster.
With 45x coverage of human genome sequence data that takes up roughly 135.3 GB of space, their technique can squash that data down to 8.2 GB. This is more than four times smaller than what can be achieved using a standard BWT-based compressor, such as the bzip2 format.
In addition to saving space and therefore money, the Illumina team's approach can help facilitate the contraction of compressed full text indexes on large sequence collections.