NEW YORK (GenomeWeb) – A newly released iteration of the Classifier based on Reduced K-mers (CLARK) metagenomics software is able to classify a larger fraction of sequence reads in metagenomics samples than existing versions of the solution, according to its developers.
As described in a recent Bioinformatics paper, the so-called CLARK-S software allows mismatches between sample reads and reference database, which allows it to map a larger proportion of input sequences to microbial reference databases. It is one of three solutions that make up the CLARK framework — CLARK, CLARK-L, and CLARK-S — which was developed by researchers at the University of California, Riverside to analyze data from metagenomics samples.
Rachid Ounit, a doctoral candidate at the University of California, Riverside and one of the method's developers, told GenomeWeb that he and his colleagues in UC Riverside's Computer Science and Engineering department developed the framework in collaboration with researchers in the school's botany and plant sciences department who were studying the barley genome. "You have a lot of repeats [in] an incredibly huge genome," so "one sequence may go to many different chromosomes," he explained.
Standard methods like Blast, which some researchers use to align ambiguous sequences to reference genomes, take a lot of time to run and may not be as accurate, he said. CLARK was designed to help researchers classify those sequences accurately and with greater sensitivity than existing methods
As explained in a BMC Genomics paper last year that describes the first two solutions — CLARK and CLARK-L — CLARK is a k-mer-based method that works by matching sequence reads from metagenomics samples to a database of reference sequences from viruses, bacteria, and other microorganisms. It only accepts exact matches between the reads from the sample and the reference database and discards any hits that map to multiple organisms.
That paper also compares CLARK to existing methods such as the Naïve Bayes Classification, developed by researchers at Drexel University and Kraken, which was developed by researchers at Johns Hopkins University. According to the findings reported in the BMC Genomics paper, CLARK is faster and more accurate at classifying metagenomics and genomic sequences than some competing methods and comparable to others. Specifically, "CLARK is the first method able to perform classification of short metagenomics reads at the genus/species level with a sensitivity comparable to that of NBC, while achieving a comparable speed to Kraken," the researchers wrote. In some situations, CLARK was faster and more precise than Kraken at classifying sequence at the genus and species level, according to its developers.
The second iteration of the software, CLARK-L, works in the same way as CLARK but is designed for researchers with limited access to compute power and memory — it is also described in the BMC Genomics paper. According to its developers, this iteration of the software provides precise classification on small metagenomes quickly and accurately. It works by building discriminative k-mers using non-overlapping and distant k-mers in the targets. It then searches for exact matches to the k-mers in metagenomics databases.
This iteration of the software requires less than four gigabytes of random access memory and can easily run on a laptop, Ounit said. CLARK-L is less sensitive than CLARK, so it will not be able to match all of the input reads, but it will run almost as quickly as CLARK and is as precise.
The decision to add CLARK-S to the suite grew out of an observed need for a solution that could map a much larger proportion of the reads found in metagenomics samples, according to Ounit. Because CLARK focuses on exact matches, it can only match about 10 or 20 percent of the reads in the sample and the rest are unknown.
"So we've been working on it, and the way to improve the tool to identify more reads is to relax the constraint of exact matching," he said — and that's what CLARK-S is designed to do. It allows mismatches between sample reads and reference databases, but it still requires that they are discriminative hits, he explained. What that means is that if a sample read maps to an organism in the reference database, then the software allows it, but if it matches multiple species then the hit is not discriminative enough and is discarded.
This flexibility in read matching is important for analyzing samples where a lot of the species present are unknown, such as those found in ocean water samples. "There is a lot of diversity in the ocean that current databases do not contain ... but some of the species known in the database may be close enough to the species in the ocean," he said. "By allowing mismatches, you may not identify the organism at the strain or species level but you may identify them at a higher rank for example, genus or phylum."
Since it was published, CLARK-S has been used in at least one study of seawater samples that was published in July in the International Society for Microbial Ecology journal.
Compared to other members of the software family, the CLARK-S classifier has the highest random access memory requirements — it currently needs at least 100 gigabytes of RAM to run, although the developers hope to be able to bring that requirement down, Ounit told GenomeWeb. It is also slower than other methods in the family. According to the paper, CLARK-S classifies approximately 200,000 short reads per minute. In contrast, CLARK classifies about 3.5 million reads per minute.
However, if the user has access to more cores, they can get CLARK-S to run faster. For example, using eight cores, CLARK-S can classify about one million reads per minute, according to its developers.
However, CLARK-S is more sensitive at classifying sequences at the species level than other versions of the software as well as at least one other solution. According to the results of one experiment reported in the Bioinformatics paper using 14 simulated metagenomics datasets, CLARK-S returned results that were more sensitive and precise than CLARK and JHU's Kraken. Furthermore, when they used the software to analyze real datasets, CLARK-S, on average, classified 10 percent more reads than Kraken and 27 percent more reads then CLARK, according to the paper.