Skip to main content
Premium Trial:

Request an Annual Quote

Geneformics Sees Lossless Compression as Answer to Cloud Genomics


CHICAGO (GenomeWeb) – Earlier this month, startup IT company Geneformics Data Systems introduced Geneformics D, a distributed cloud compression system that promises to boost the speed of genomics data uploads, downloads, storage, and archiving by tenfold and reduce costs by 90 percent.

The initial release is currently integrated into Amazon Web Services infrastructure. Geneformics, which is based in Sunnyvale, California, and has its research and development operations in Petach Tikvah, Israel, said it is working on versions for other cloud platforms.

The new product focuses on alleviating workflow bottlenecks associated with the migration of data to the cloud, according to CEO Rafael Feitelberg.

"Geneformics is all about providing the tools and infrastructure to make genomics data accessible through compression," Feitelberg said.

"We see the issue of the mere size of the genomic data as one of the main inhibitors for genomics to be really ubiquitous in the world," he explained. A sequenced human genome might be 200 to 300 gigabytes of raw data, while an analyzed genome could take up a full terabyte of disk space. "If you want to create gene banks, the mere size of the data is going to be very, very prohibitive," Feitelberg said.

"The uploading of that data to the cloud, the storage of that data, the archiving of that data, all of them become more and more problematic just due to the size," he contended. To Geneformics, lossless compression technology makes data management more accessible, usable, and affordable.

Feitelberg emphasized losslessness. "Our view is that researchers and bioinformaticians shouldn't ever change the analysis that they're doing because of data compression," he said.

"Data compression should be something which is really hidden from them in a lossless and transparent way. What that means from a compression and solution perspective is that we are capable of decompressing the data at high speeds and actually streaming it back to all of these applications in a lossless form," he continued. "It is equivalent, bit for bit, to the original, uncompressed file."

Geneformics grew out of the Weizmann Institute of Science in Rehovot, Israel, based on the data compression work of Weizmann computational biologist Eran Segal, who cofounded the company in 2014 with career technologist and current Geneformics Chief Technology Officer Arik Keshet.

Funding has come from investors including Geneformics Chairman Dov Moran, who created DiskOnKey, widely cited as the first USB flash drive. Moran and two private equity firms have put about $2.85 million into Geneformics, according to Crunchbase.

Geneformics D is the first purely cloud-based offering the company has released, but Geneformics has existing products for locally hosted installations. Geneformics S is a locally hosted server for lossless compressed files, while Geneformics C is a streaming-file sharing application that can reside locally or in the cloud. Another product, Geneformics U, brings similar technology to Unix installations.

In all cases, the technology compresses the size of FASTQ files by about 10 times and BAM files by 2.5 times, the company said.

Geneformics D for AWS is significant because Amazon has tiered pricing. The cost of intermediate storage packages could be 45 percent lower than AWS's basic S3 (Simple Storage Service) product. The Geneformics technology is about "being able to intelligently store the relevant genomic data on the tier which is most cost-effective at significant additional savings beyond what you can get with just compression," Feitelberg said.

"With compression, we will reduce the footprint by up to 90 percent. In addition, by having an intelligent tiering at the granular level of the genomic data, then we can even increase those savings more," he added.

This is important in genomics because researchers often only need a small dataset of a given genome. "You don't need to keep all of the genome data in [an] expensive tier," Feitelberg explained.

"Because [our technology] compresses in blocks, we have an intelligent cache that ... manages what you are actually seeking all of the time and separates it between high-accessibility data on a more expensive tier that gives you high performance and moves the less-used data in a compressed way to a less-expensive tier."

Customers include Cambridge, Massachusetts-based sequencing company WuXi NextCode and the Garvan Institute of Medical Research in Sydney. "We're working together to enable their platform to have ubiquitous compression," Feitelberg said of Garvan.

The Garvan Institute has one of the largest genomic datasets in the world, Feitelberg noted. "It was a very fruitful partnership in being able to build an infrastructure for them so that as much as they'll grow, they'll always grow in a compressed and efficient way," he said.

This holds true for accumulating data as well for sharing among researchers. "Instead of sending a full-blown dataset, being able to send a compressed version of it is really very, very important."

Geneformics does use proprietary technology compression technology. "This being a young industry, there are really no compression standards as of now," Keshet said. "You don't have your equivalent of JPEG or MPEG" in genomics. The company provides free decompression and reading software, however.

"Eventually, when this space matures, we expect the standards to be formed. At that point, we will have the technology and the [intellectual property] and the market presence to influence those," Keshet said.