NEW YORK (GenomeWeb) – Cambridge, UK-based startup PetaGene is hoping to make its bread and butter by offering software for compressing sequence files, making it possible to store them in a fraction of the space currently required for storage.
The PetaGene Suite, which officially launched earlier this year, comprises complementary software tools that help researchers reduce Bam, Cram, Fasta, and Fastq file sizes. Specifically, it features a tool called FasterQ, which compresses Fastq files into the smaller PetaGene-developed format called Fasterq; and BayesCal, which uses a statistical approach to refine base call quality scores and thus help reduce file sizes.
The suite also includes a tool called PetaView, which lets users convert standard Bam and Fastq.gz files into more efficient compression formats like Cram. Besides the PetaGene Suite, the company also offers a product called BayesQual, which is used for improving base quality score recalibration.
PetaGene is pursuing a rather unusual business model. It will get paid based on how much money customers are able to save on storage costs by using its software and "we only make money if they save money," Dan Greenfield, PetaGene's co-founder and director, told GenomeWeb at the Bio-IT World Conference last month in Boston. The company is still working out the exact details but it plans to charge some price per terabyte of storage freed up. "We can promise you that it will be much cheaper than going out and buying more storage," he said.
According to internal benchmarks, PetaGene's software suite compresses sequence files by up to fivefold without compromising the genotyping accuracy. That results in an up to fourfold reduction in lossless storage costs and up to a fivefold reduction in lossy storage costs, the company estimates.
Smaller files are easier and quicker to move to and from external servers. According to PetaGene, its software improves data transfer times by about a factor of five compared to transferring larger gzipped FastQ and Bam files. In one assessment, it took nearly six minutes to transfer a Bam file containing sequence from the NA12878 platinum genome over a wide area network with a speed of approximately 10 megabits per second. In contrast, it took a little more than a minute to move a PetaSuite compressed file.
PetaGene demoed the software for potential clients at Bio-IT conference where it took home one of the Best of Show awards. The suite was developed by a team of researchers from the University of Cambridge as part of a collaborative study with the European Bioinformatics Institute that focused on methods of compression and storing genomic data more efficiently.
One way to do that was to remove information in sequence files that is not crucial to analysis and to store these in a separate storage tier, Greenfield explained. The researchers also looked into mechanisms of compressing base quality scores, which can still take up as much 60 to 80 percent of sequence file space even after the scores have been quantized — which means base calls are grouped into representative bins rather than stored individually. They wanted to be able to reduce the amount of space that base calls take up without reducing the genotyping accuracy — this can be a problem if the sequencing coverage is too low to begin with.
"We started thinking about it from the perspective of ... sequencing [being] like WiFi," he said. Both processes essentially involve transmitting noisy data from one point to the next and both have to deal with lost information and error introduction along the way. "We came up with a Bayesian approach based on this idea ... that works really well in terms of refining the quality scores." That method, BayesCal, determines the likelihood of a sequencing error at each base by looking at it in the context of the full sequence read and associated quality score as well as the full genome. It then adjusts the base's quality score based on this additional information.
Compression is a side effect of the modification process. "For most bases there is typically sufficient evidence to improve the quality scores ... that they reach a saturated [or] maximum allowed value," Greenfield explained. "This means that there is less variability in the quality scores and thus [the file] compresses better." Also, since scores are modified based on a broader corpus of information "we generally preserve or improve genotyping accuracy," he added.
The BayesQual component of the PetaGene suite is based on the same technology but it performs some additional operations related to the base call quality score recalibration stages of sequence analysis pipelines, Greenfield said. This tool is also able to compress data at different stages of the analysis pipeline such as after mapping or duplicate marking.
The process for compressing files in the software is quite simple. When full sequence files such as Bam or Fastq.gz files are loaded into PetaSuite, the software compresses them into the Cram file format, checks the quality of the conversion, and then discards the original. Those reduced-size Cram files are then stored on users' systems. In PetaView, users see those Cram files as "virtual" Bam files — they are labeled this way so that users know where the data originated from — but what they actually get when they access the files are the compressed Cram version.
The file size reduction numbers that PetaGene reports are quite good. According to internal tests, BayesCal compressed a roughly 79-gigabyte FastQ file down to about 55 gigabytes. It was also able to compress a 72-gigabyte Bam file down to about 44 gigabytes; and a 35-gigabyte Cram file down to about half its size — just under 18 gigabytes — among other benchmarks. Final Bam file sizes for BayesQual are three to four times smaller than the originals and Cram files are seven to eight times smaller when the tool is used, according to internal benchmarks.
The compressed files can be used just like the original Bam. This is valuable for researchers who want to use PetaGene-compressed files in third-party tools that may not support some formats or newer versions of formats like Cram, Greenfield noted. For example, if a researcher running the Broad Institute's Integrative Genomics Viewer wants to visualize a portion of chromosome 22 from sequence they have compressed in the Cram file format, which IGV does not support, PetaView will convert the corresponding portion of the Cram file and present it in the Bam format within the browser, he said. "As far as [the browser] is concerned, it's a normal Bam file."
Another benefit of PetaView is that it lets users take advantage of tiered storage infrastructure. In the so-called tiered lossless mode of the software, "we allow splitting the original Bam into a reduced size BayesCal Cram file and delta file which can reside on different tiers," Greenfield explained. The delta file holds data that is not crucial to the analysis and can be stored on slower, less expensive storage resources while the smaller, critical datasets are stored on more expensive, higher-speed storage disks. Researchers can use the smaller BayesCal version but if they for some reason want to access the original file, PetaView can automatically reconstruct it, he said. This ability to take advantage of tiered storage helps reduce I/O load as well as lower overall storage costs, he added.
So far, PetaGene has secured one customer for its software. Researchers in the UK's National Health Service's Blood and Transplant division are using the software as part of their efforts to analyze rare disease data. The company has other customers in its pipeline but it has not finalized deals with them at this point, Greenfield said. The company is also involved in a number of collaborations with groups at places like The Genome Analysis Centre in the UK through which it gains perspective on the different ways researchers want to use its technology and how best to meet those needs, Greenfield said. PetaGene offers its software as Debian and rpm files so any research running Linux will be able to install them.
In the market, Greenfield expects PetaGene to compete with firms like Geneformics, an Israeli bioinformatics startup that offers both local and cloud-based options for compressing genomic data. A number of academic research groups have also developed genomic data compression methods. For example, researchers from Simon Fraser University in British Columbia and Indiana University developed software in 2014 called Deez for compressing Sam and Bam files to reduce storage requirements. Another study published in BMC Bioinformatics in 2014 by researchers from China and US describes an algorithm called MMQSC that compresses Fastq files by extracting quality scores.
Greenfield said that PetaGene's software outperforms a number of these academic solutions and that the company will soon publish a paper in Bioinformatics that describes how an early version of BayesCal beat existing academic solutions, he said.