Compression Genomics

A collaborative effort between researchers at MIT and Harvard University has produced a new, high-speed genome search algorithm described in the latest issue of Nature Biotechnology.

The new algorithm combines the power of data compression algorithms with genome alignment search tools.

Capitalizing on the fact that most currently sequenced genomes are very similar to previously collected ones, the team exploited this redundancy to allow for computation on compressed genome data. This approach shaves off time during the analysis of highly similar genomes to that of the time it takes to operate on one genome.

“You have all this data, and clearly, if you want to store it, what people would naturally do is compress it,” says Bonnie Berger, a professor at MIT and senior author on the paper. “The problem is that eventually you have to look at it, so you have to decompress it to look at it. But our insight is that if you compress the data in the right way, then you can do your analysis directly on the compressed data. And that increases the speed while maintaining the accuracy of the analyses.”


As described in their Nature Biotechnology paper, the researchers have implemented accelerated versions of both Blast and BLAT and underscore the importance of compression as a way to cope with ever-increasing amounts of genome data.

One obvious drawback of an approach like this is that, as more genomes are added to a database, the speed resulting from the analysis of compressed genomes decreases.

Click here to download the source code for the prototype of their implementations.

      Matthew Dublin is a senior writer at Genome Technology.