Researchers from the Massachusetts Institute of Technology, Tufts University, and Harvard Medical School have developed a method of compressing protein data that they claim speeds up tasks such as protein structure prediction and orthology mapping while providing results as accurate as those generated by existing tools.
According to a recent Bioinformatics paper detailing theat approach, the compressively accelerated protein Blast, or CaBlastP, provides a framework for compressing protein sequence databases and performing "a variety of homology search techniques in compressed space." Since it allows researchers to run queries on much smaller datasets, CaBlastP is faster than tools such as HHblits and standard Blast algorithms. Its runtimes that scale "almost linearly in the amount of unique data" compared to "current BLASTP variants, which scale linearly in the size of the full protein database being searched," the researchers wrote.
The CaBlastP approach, which the researchers also presented at the Intelligent Systems for Molecular Biology conference last month, is a version of a technique developed for compressing and searching genomic databases. That technique, which was published in Nature Biotechnology last year, uses "compression accelerated" versions of Blast and BLAT called CaBlast and CaBLAT, respectively, that are at least twice as fast as the original algorithms (BI 7/13/2012).
According to the current Bioinformatics paper, the original motivation for developing a compressive approach for searching protein sequences "was the growing running time of BlastP searches" on the National Center for Biotechnology Information's non-redundant database; however, the researchers say that the method could be used for other time-consuming applications such as "orthology mapping across organisms [and] performing an all-against-all search between a query proteome and a set of well-studied proteomes."
The CaBlastP framework is based on the same idea as its nucleotide search counterpart. It first compresses protein databases and then applies three Blast-based protein search algorithms: BlastP, DELTA-Blast, and PSI-Blast. Noah Daniels, a post doctoral student at MIT and a co-author on the paper, explained to BioInform that like the nucleotide-based method, CaBlastP takes advantage of similarities between protein sequences to reduce the quantity of data that needs to be searched and by extension shorten the time needed to run queries.
It does this, he said, by representing identical protein strings with "canonical" sequences and then storing these in a "link index" while unique sequences are stored in a "coarse" database. This is the most computationally intensive part of the process, the researchers note, however, it only needs to be done once for a given database — the compressed resource can be reused for future searches and "incrementally maintained to keep current with new proteomic sequence data."
Next during the search phase, query sequences are searched against the coarse database and candidate hits are compared to the representative sequences stored in the link index, the paper explains. In every case where there is a match, CaBlastP reconstructs the original sequences that comprised the canonical sequence. The candidate hits are then compared to the original sequences to obtain the final result. It’s different from other methods that use compression because in those cases, researchers often have to decompress the data before they can search it, Daniels said. "We only decompress results, which is a small fraction of the data." Plus "we are not throwing out information," he said. "We keep the original data."
The result is a much faster search approach that does not compromise on accuracy, according to Daniels et al. In a homology search test that compared the compression-accelerated versions of the BlastP, PSI-Blast, and DELTA-Blast algorithms with the standard BlastP to search data from the NCBI's NR database, the researchers reported over 99 percent and 100 percent overlap in terms of sequence hits and alignments respectively. "In other words, when a hit is found, the alignment perfectly matches the standard BLASTP alignment."
Meanwhile, in a test of speed between CaBlastP and the standard BlastP using a version of the NCBI database that contained 11.6 million protein sequences, CaBlastP took 50 seconds to complete its search compared to 120 seconds needed by BlastP. On a larger version of the data comprising 22 million sequences, CaBlastP took 75 seconds to search while BlastP needed 240 seconds.