Researchers from University of Maryland's computer science department and Johns Hopkins University's Center for Computational Biology have developed software called Kraken whose approach to classifying sequences in metagenomic samples by taxonomy works much faster than current methods used by programs such as Megablast, MetaPhlAn, and PhymmBL without compromising accuracy.
According to the results of comparison tests between Kraken and similar sequence classification programs using simulated metagenomics data with high error rates — 2.1 percent SNP rate and 1.1 percent indel rate — in its standard mode, Kraken processed over 1.3 million simulated reads per minute, classifying the sequences by genus with more than 99 percent precision and 91 percent sensitivity. By way of contrast, Megablast processed just over 4,500 reads per minute with greater than 96 percent and 93 percent precision and sensitivity, respectively; while PhymmBL processed about 76 reads per minute with around 96 percent precision and sensitivity.
Kraken also does well with real genomes. At the Genome Informatics meeting last week at Cold Spring Harbor Laboratory, Derrick Wood, a doctoral student at UMD and one of Kraken's developers, presented the results of software comparisons using HiSeq data which showed that Kraken processed around 1.2 million reads per minute with high precision and sensitivity compared to tools like MetaPhlAn, which processed more than 440,000 reads per minute.
Kraken owes its speed to its use of exact alignments of k-mers — as opposed to programs like MegaBlast that look for the best match between sequences — and a hash table that is designed to facilitate fast queries of overlapping k-mers. Wood, who developed Kraken with Steven Salzberg, a professor in JHU's departments of medicine, biostatistics, and computer science, explained to BioInform that the hash table — which they created using data from the complete bacterial, archaeal, and viral genomes in the RefSeq database — maps each k-mer to the lowest point in the taxonomic tree that covers all genomes containing that k-mer. Each sample sequence to be classified has all of its overlapping k-mers queried against this hash table.
"Because two overlapping k-mers will have a common subsequence between them, Kraken structures its hash table such that k-mers with common subsequences are stored close to each other in memory" so that "after one k-mer is queried, a very similar one will be in the next query due to the overlapping," he explained. "This allows Kraken to have many of its queries access data in the CPU caches, rather than the random access memory, which makes Kraken's search process much faster."
Kraken actually stores the full reference database in the RAM; however, because the database is structured to keep similar k-mers close to each other, the CPU's cache will often have the data that Kraken needs to compare sequences. As a result, the matching step occurs about 10 times faster than would be the case if the CPU had to get the data directly from RAM each time, according to Wood.
"If Kraken didn't [or] couldn't store the database in RAM, it would be much slower, as the data would often be located on a hard drive" and accessing the data there would be "nearly a million times slower than RAM access times," he said. The software would still be able to run, but its performance would be on par with slower classification programs.
Also, Kraken is able to obtain precise matches because it uses long k values. Each k-mer is 31 nucleotides long, which, in addition to enabling more rapid comparison, is also long enough to contain information that is specific to the genome of origin. Smaller strings, on the other hand, aren't going to be as unique and the chance of making erroneous alignments goes up.
Researchers who want to use the software can take advantage of the pre-built database of RefSeq genomes or they can develop their own. Those who choose the first option have their pick of two versions of the database. They can use the full database, which is 70 gigabytes and requires a significant amount of RAM in order to get a decent amount of speed. For those who don't have that much memory at their disposal or who want to run the software on their laptops, there's MiniKraken, which uses the same approach to classify sequences. The only difference is it compares input k-mers to a pared down version of the default database that contains a subset of the reads found in larger database but is only 4GB in size.
With this smaller dataset, "your coverage of all the genomes drops quite a bit, but there's still enough in there that as long as you have high accuracy sequences … you'll still be able to classify the majority of reads," Wood said. In tests using the simulated dataset, MiniKraken was able classify k-mers with greater than 99 percent precision and 65 percent sensitivity which is "sufficient" for many use cases, Wood said.
Both the standard and mini versions of Kraken also have quick operation modes that classify input sequences based on a single hit — the first match that it finds in the reference database. Wood noted during his talk that although this approach is a little less accurate than the standard version of Kraken, which looks for exact alignments, it processes around 4 million reads per minute — still much faster than current methods.
Wood and Salzberg plan to publish a paper that will describe in detail Kraken's exact alignment-based approach to sequence classification. They also plan to improve the software to make it work even faster.
Other modifications to the tool will depend on users' feedback. "If there is a particular feature that’s requested … and if it's compatible with the way Kraken works, then I'm happy to put it in," Wood said.