Storage startup Ocarina Networks has identified the life science-research market in general, and second-generation sequencing labs in particular, as key customers for its compression technology for online storage, which it claims offers up to tenfold better data reduction than generic compression solutions.
This week, the company said that Cornell University's Center for Advanced Computing is testing its "content-aware" approach to data compression, which uses algorithms that optimize the compression based on specific patterns within the data itself.
The Cornell group, which will be using the company's ECOSystem to compress data from second-generation sequencers and other life-science instruments, is the first life-science user that Ocarina has disclosed. However, Carter George, Ocarina's vice present of products, said that around 20 different labs are currently evaluating the system, including several genomics centers. He did not elaborate.
The ECOsystem includes more than 100 algorithms that support more than 600 file types, including many life-science formats, such as .dat, .arr, and .cel files for the Affymetrix platform; .tiff and .txt files for Illumina; and the .srf short-read format.
Ocarina claims that a "first pass" of the ECOsystem can reduce the size of these files by up to 75 percent. That step is followed by a de-duplication step that the company claims can reduce a genomics, biomedical, or life-science data archive by up to 90 percent.
At Cornell, Ocarina is working with the Center for Advanced Computing and storage provider Data Direct Networks to test the ECOsystem on Cornell's DDN S2A9700 storage platform, which can manage up to 1.2 petabytes with throughput of up to 6 gigabytes per second.
David Lifka, director of the CAC and adjunct associate professor of computing and information science at Cornell, told BioInform that his group is "doing initial testing on it as we speak," and plans to go live within a week.
The CAC is a "research service unit" for the entire Cornell campus, with an increased need to "store massive amounts of data," Lifka said.
The center was looking for ways to get "storage at a price that is something that researchers can afford," but without compromising availability.
Lifka and his colleagues were targeting $1,000 per terabyte per year, including maintenance, for the system, which they plan to upgrade in three years. "Now with the Ocarina solution, it seems likely it won't be unusual for a researcher to get potentially up to 2 terabytes because of the compression rates for $1,000 per year," he said. "That's pretty compelling because you can be as cheap as the USB drives you can plug into your laptop with all the great performance and availability and characteristics."
He said that he evaluated other storage systems, including BlueArc and Isilon, but found the DDN/Ocarina system had advantages in terms of price, performance, and data integrity.
Added Ocarina's George, "If you look at what's happening with devices, just buying disks isn't good enough; [it's] too expensive."
Ocarina's ECOsystem appliance sits between a storage box and users, George said. While it works well with the DDN system, it can work with any storage system, he added.
The system is made up of two parts: the Optimizer, which shrinks data and files, and the Reader, which is software that integrates with file servers to decompress the files. According to Ocarina, the Optimizer can compress as much as 5 terabytes of data daily.
[ pagebreak ]
One area that Ocarina views as a target market for its system is second-generation sequencing. "Illumina spits off a pretty big file," George said. As files come off the device, they are "hot," he said, with the compute cluster calculating intensity tables, base-calling, and quality scores followed by assembly.
For the first week, as data sits in primary storage, there is much input/output activity, so "we don't touch it during that time," he said. However, after that step, as access decreases, that becomes the "sweet spot for shrinking it."
George said Ocarina has developed proprietary algorithms geared toward the life sciences. "Generic compressors don't get very good results on life-science files," he said.
For example, Ocarina has developed a compression algorithm exclusively for Illumina's intensity tables. "So I can tell you exactly what I'll get on a sequence read file on an Illumina intensity table," he said.
George said that Ocarina is "talking to" Illumina and other sequencing manufacturers but declined to offer further details. "We have a library of very specific compressors," he said. "Nobody else has put in the effort for specific compressors down to that level of detail, for specific data types."
George noted that Ocarina's technology differs from other compression systems from companies like EMC's Avamar or Rocksoft, which do "de-dupe" — looking through files to find duplicated chunks. "That kind of technology does not work at all on life-science data," he said. "We do both, but for the life-science market, we don't even turn on de-dupe because you are not going to find any duplicates."
A sequence read file might contain called bases, quality scores, and intensity table images. "If you want to compress an .srf file, you have to figure out what was put in it," George said. The boundaries between the different data types must be determined and different algorithms need to be called for different parts of that file, he added.
In addition, "the intensity file will have different patterns depending on what device made it." The ECOsystem algorithm is able to determine what kind of file it is and which instrument it originated from, he said. "Based on what it is, I'll call a more specific algorithm to deal with it."
At Cornell, life sciences comprise the largest funded area of research, Lifka said. Over the last two weeks he and his colleagues have set up the DDN box with 42 terabytes of available storage.
"In week one, 22 of the terabytes were allocated to life sciences," he said, of which biotechnology comprised 15 terabytes and the computational biology group comprised the remaining 7 terabytes.
Lifka noted that he expects the center's storage needs to grow because it will have second-generation sequencing equipment coming online shortly.
Ocarina said that it can obtain 50- to 75-percent compression rates. "Fifty percent would be marvelous, and if they are getting better than that on certain data types that are important to the life sciences, then all the better," Lifka said.
One of the reasons Cornell chose Ocarina is the firm's past work with genomics-specific data types, such as read formats from the Illumina sequencing platform, said Nate Woddy, a research associate at Cornell who focuses on drug-discovery informatics.
Content-awareness plays out in several different ways, said Woddy, a former cheminformaticist at GlaxoSmithKline. When the Ocarina system compresses a file, he said, "it examines how well the different compression algorithms are actually working." This optimization process is based on the file type. For example, if it detects a file that is already compressed, Woddy said, it doesn't compress that file further.
"That's one of the reasons we are working with them, to make sure they do get exposed [to] the kinds of files that the people here are dealing with," he said. That will allow the company to optimize its compression algorithms, and will allow Cornell to get "the compression performance" out of the algorithms, Woddy said.
"They are going to be tuning their product, and we are going to be benefiting from getting better compression on our data, so it's a true partnership," Lifka said.
Lifka said that his team will look at Ocarina's compression rates to make sure the system not only compresses the data optimally, but correctly. "Write-once read-never data is not good," he said.
"One of the other advantages of Ocarina is that they give you immediate access to the data while they are decompressing it," Lifka said. That builds on the advantage of writing to disk over writing to tape. "Those are really the two things we are going to be really keen to see how well it works."
The next milestone will be looking at the most popular data types and "seeing what kind of compression rates we get," Lifka said. He also wants to assure that researchers "are satisfied" in terms of data integrity and performance. "Then our plan is to scale it, grow it on demand because we think it is going to be the right solution for Cornell," he said.
As Ocarina's George explained, the Optimizer compresses a file and writes it out as a "shadow" file. "For a period of time you have the original and the compressed [file]," he said.
Once complete, the system decompresses the compressed file, takes a 256-bit cryptographic checksum, does that also on the original file, and compares the checksums. "If they match, that means we know we can recover the original file bit-for-bit lossless from the compressed file and we go ahead and replace the original file with the compressed file."
The checksum is stored in a database so that at any point in the future, the checksums can be compared. Although not common in the life sciences, other industries request two replicates of the compressed file to compare for so-called "silent" data corruption, which can occur when a disk is degrading, for example.
George said he spends about a third of his time working with potential life-science customers. Ocarina also has a foot in the film industry, as well as Web 2.0 photo sites and social-networking sites. The firm's customers include Kodak and Photoways, and the ECOsystem is installed "in three out of the top five" movie studios, he said.
Genomics has particular urgency, he said, "because there are a bunch of people screaming bloody murder" about their storage challenges.
Privately held Ocarina recently closed an undisclosed series B round of financing. Existing investors include VC firms Kleiner Perkins Caufield and Byers and Highland Capital Partners, as well as Stanford University.