Researchers at the European Bioinformatics Institute have released the first version of Cram, a new software toolkit and file format for compressing and storing next-generation sequence data more efficiently.
Guy Cochrane, team leader of the EBI’s European Nucleotide Archive, explained that his group began developing Cram to help control storage costs for the ENA, “but there is a clear utility for anyone facing their own data storage challenges, for those computing on large data sets and for those transferring data around networks.”
According to its developers, Cram achieves a balance between usability and high levels of compression. It offers an alternative to existing formats that are less efficient in terms of disk space, and the toolkit features several compression models. Users can choose between the different compression models as well as set the parameters to control the amount of compression depending on what their requirements are.
Cram builds on a reference-based compression method that the development team described in paper published last year in Genome Research.
Cram’s compression activities have “a sequence and a quality-per-call component,” Cochrane explained to BioInform this week. On the sequence side, Cram simply aligns raw sequence reads to a reference genome, and fewer differences between the two datasets results in greater compression.
The quality component, on the other hand, doesn’t “fit” the reference-based model and so for this part of the process, Cram uses one of two modes to squeeze sequences — lossless and lossy compression. In the former, the original data are fully preserved. Users who choose lossy compression can apply a controlled loss of precision to achieve compressed files that are dramatically smaller.
In lossless compression, all of the quality per call information is retained, so “you get some compression” but “it’s not the greatest possible compression,” Cochrane said.
In this mode, “there are bits of information that you may associate with sequence reads … for example, individual names for the reads that you are trying to store [and] we remove those for the sake of getting compression,” he said. “You can still point to individual reads, so the function is preserved, but you can’t make an exact copy of the input data … so it’s lossless with respect to sequence and quality alone.”
In lossy compression — “which are lossy with respect to the quality information but lossless still with respect to the sequence” — users have two options, Cochrane said.
“In the first mode, it is a simple compression,” he explained. “You apply a uniform quantization of the quality scale, so you simply remove precision evenly from the quality scale and that gets you a further compression advantage.”
A second, and more complex, lossy compression mode works “on the premise that … the useful bits of information in the quality track are not evenly distributed” and, consequently, “there are certain data points that have a greater impact on the analysis output,” he explained.
“That means there are certain data points that you can degrade confidently without losing the important message in the data and there are other data points that you have to preserve,” he continued. Cram “supports the application of a number of different inbuilt models for doing this and it also supports external models whereby the user defines exactly where the qualities are to be degraded and where they are to be preserved,” he said.
Depending on the kind of data, the way it’s stored, and other variables, in its lossless mode, Cram can provide between a two- to four-fold reduction in disk space compared to a typical BAM or compressed FASTQ file, Cochrane said.
In simple lossy compression mode, it provides a four- to five-fold reduction in disk space while the more advanced lossy compression mode can provide a 10- to 20-fold or more reduction in the amount of storage required compared to currently used formats, he said.
“It’s a sliding scale” because “it depends on how you use it,” he stressed.
Meanwhile, the jury is still out on exactly which compression mode best fits which datasets, Cochrane said. His team published a commentary in GigaScience earlier this year that proposed “a graded system in which the ease of reproduction of a sequencing-based experiment and the relative availability of a sample for re-sequencing define the level of lossy compression applied to stored data.” The group is also involved in community discussions around the application of Cram to different types of data.
“The capacity to do the compression is there, and the technology is available but we need to decide as a community exactly how to use it,” his said. In particular, the community needs “to understand which models should be applied to which types of data, in which context” and “how aggressively to compress different datasets.”
Cram’s developers are preparing a paper that will describe the toolkit in detail, which will be published at a later date, Cochrane said. That paper will also include a comparison of Cram with other tools, he said, although he could not provide additional details.
The group has also “been working really hard to make this integratable with other people’s tools so that the Cram format will slot right into existing analysis tools and pipelines,” Cochrane said.
Users can also “directly access a Cram compressed file and ask questions of it,” he said. “Unlike a more generic compression tool, you don’t have to compress everything and then decompress it every time you want to use it.”
Furthermore, the toolkit lets users transform BAM and FASTQ file formats into Cram and then back, he said, and the group has also enabled it to be built into higher-level tools. For example Cram is interoperable with SAM-JDK — a java implementation of SAMtools — which makes it accessible through tools like Picard, he said.
He added that the group is working to integrate the toolkit with other tools including a C implementation of SAMtools.
The ENA database now accepts submissions in the Cram file format, alongside existing formats, and will be moving towards a full Cram-based infrastructure in the future.
So far, a number of sequencing centers and vendors as well as academic groups are exploring the Cram toolkit and file format, although it isn’t clear at this point if any have incorporated it into their pipelines, Cochrane said.
One example is the University of California, Santa Cruz, which is exploring Cram as one of two compression approaches to reduce the amount of storage needed to hold data in its Cancer Genomics Hub (BI 5/4/2012).
Other efforts to reduce the storage footprint of NGS data include one developed by a researcher at the Wellcome Trust Sanger Institute, which nabbed top marks during the “Sequence Squeeze” competition organized by the Pistoia Alliance earlier this year. That approach used a cluster of algorithms to compress the genomic sequences prior to storage (BI 4/27/2012).