Researchers from MIT and Harvard University have developed a data-compression approach that they said accelerates queries for specific gene sequences in large genomic databases.
In a commentary published in this month's Nature Biotechnology, the scientists explained that they have used the method to create "compression accelerated" versions of Blast and BLAT called CaBlast and CaBLAT, respectively that are at least twice as fast as the original algorithms.
According to the developers, these accelerated algorithms serve as proof of concept that the so-called compressive genomics approach not only reduces storage space but also accelerates sequence searches.
“You have all this data, and clearly, if you want to store it, what people would naturally do is compress it,” Bonnie Berger, a professor of applied math and computer science at MIT and one of the authors on the paper, explained in a statement. “The problem is that eventually you have to look at it, so you have to decompress it to look at it. But our insight is that if you compress the data in the right way, then you can do your analysis directly on the compressed data. And that increases the speed while maintaining the accuracy of the analyses.”
According to the authors, the approach takes advantage of redundancy in the data — the fact that there is a lot of overlap in the genomes of closely related species, and some overlap even in the genomes of distantly related species — and mathematically represents these different genomes such that the overlapping data is stored only once.
“This redundancy among genomes can be translated into computational acceleration by storing genomes in a compressed format that respects the structure of similarities and differences,” such as insertions and deletions and rearrangements that are “important for analysis,” the paper explains.
Once the compressed library has been created, “it can be analyzed in an amount of time proportional to its compressed size, rather than having to reconstruct the full data set every time one wishes to query it,” the authors said.
The approach “reduces the task of computing on many similar genomes to only slightly more than that of operating on just one,” the authors wrote.
“If you can make use of that redundancy in your compression …then you don’t have to search all 30 genomes [for example]. You can basically search the equivalent of one genome, find the place where the redundancy occurs and then you can do a finer search in that particular region,” Berger told BioInform.
She explained that the approach involves two phases. In the first, “we preprocess the original stream of data using a BLAST-like approach to look for sequences that are roughly similar, and that determines the non-redundant databank.” Redundant sequences are compressed using the non-redundant data as a reference, and the compressed information is stored in an associated table of links.
Once this database has been created, every time “a new [sequence] comes along to search for, we basically look in the non-redundant database … [and] when we decide that there is a decent match between the [sequence] we are searching for and that non-redundant database, we follow the associated linked table to the compressed version of the rest of the data.”
At that point, “we actually do decompress the few linked sequences that it is a potential match to so that we can do a deeper search on the few things that are similar,” she said. However, “it’s relatively nothing compared to the huge amount of data.”
The paper notes that this approach differs from existing compression methods that have been developed for genomic data such asDNAzip. According to the authors, these tools are used primarily to reduce the space required for storing and transmitting data and require users to decompress and reconstruct the data before they can analyze it.
The researchers also note that although there have been efforts to speed up exact searches using indexing, the compressive genomics approach “will be extremely useful in the case of matching reads of unknown origin to a large database.”
When the team compared CaBlast with the non-compressive version of the software to search for a particular genetic sequence in 10 yeast genomes, the new algorithm was twice as fast as Blast. Furthermore in a search of 36 genomes, the compressive version of the algorithm was four times as fast, the researchers said.
The compression method can be incorporated into any software program, Berger told BioInform. Her group is currently working on incorporating the technique into search algorithms used for protein and RNA sequences, though she declined to disclose which specific algorithms the team plans to work with.
Meanwhile, CaBlast and CaBLAT should be useful for any applications that require the use of sequence data, the developers said.