Skip to main content
Premium Trial:

Request an Annual Quote

Qumulo Targets Genomic Data Management for its File Storage Systems


CHICAGO (GenomeWeb) – This month, Qumulo announced that it had raised $93 million in Series D venture capital. BlackRock Private Equity Partners led the round, with participation from major names like Goldman Sachs and Western Digital.

Qumulo, a Seattle-based purveyor of large-scale file storage systems, isn't a life sciences company per se, but management sees genomic data management as an ideal application for its technology.

"Qumulo as a company actually got its start in media and entertainment, but we quickly realized that folks have huge data and performance requirements in life sciences," said Cofounder and Chief Technology Officer Peter Godman.

In life sciences, the six-year-old company initially served research institutes. "From there, we have moved increasingly into clinical applications, both on the imaging side as well as the genomics side," Godman said.

The company is primarily a file-storage vendor. Its flagship product, Qumulo File Fabric, or QF2, is touted as a "universal-scale file storage solution" that handles small and large pieces of data with equal efficiency, to scale to the range of petabytes, and can provide the security necessary for industries such as life sciences.

QF2 also is built to run either in local data centers or in public clouds.

"We're interested particularly in file storage. For unstructured data — things like genomes or little bits and pieces of genomes — unstructured data storage is typically used for this," Godman explained.

Unstructured data could be made up of either objects or files. "I'm a big believer that the future is about the convergence of those things, but the big gap for a lot of folks is scalable file storage, both on premises and in the public cloud," Godman said.

"[QF2] turns plain-old commodity hardware into giant, scalable file storage systems that scale beautifully, in terms of capacity and also performance," he said. "They do that not only on premises, but also in the public cloud, and they connect those two things together. That's why we call it the Qumulo File Fabric."

Godman called next-generation sequencing of genomes a "challenging" application for data managers.

"If you're going to use [data storage] for working storage, you have a large number of small files. But genomes are typically a relatively small number of large files," he noted. "Folks often want to process and reduce right next to the sequencer at the edge, and then later on, they want to send that data to the public cloud. Sometimes they want to send it to the cloud for processing."

Godman, who worked for Dell EMC legacy company Isilon Systems in the 2000s, contended that other technology isn't built for this kind of workflow the way QF2 is. "It doesn't do great with small files and it won't get you to the public cloud," he said.

"These [genomics] workflows require file storage. Ours allows for the mutation of data and also delivers really high performance and ... consistency. These things are really important to these kinds of workflows," Godman said.

Cloud hosting is becoming more important by the week. "Customers want to run their workflows in the cloud now for collaboration," Qumulo Principal Systems Engineer Steve Noel said in May at the annual Bio-IT World Conference in Boston. He said that the shift to the cloud has "stranded" large-scale file workloads, a problem that Qumulo is trying to address.

In life sciences, Qumulo started on the research side, but is now growing in the clinical domain. "Even just the idea of clinical genomics still kind of blows my mind," Godman admitted.

Current customers in life sciences include the Carnegie Institution for Science, CID Research, Channing Division of Network Medicine at Brigham and Women's Hospital, DarwinHealth, the Georgia Cancer Center at Augusta University, the Institute for Health Metrics and Evaluation at the University of Washington, Johns Hopkins Genomics, Progenity, and the University of Utah's Scientific Computing and Imaging Institute.

Godman said these organizations have been drawn by the need for high performance as well as a growing "tension" between file and object storage.

"In many ways, life sciences is at the vanguard of that debate," he said. "Object storage traditionally offers something that folks are really interested in in life sciences, which is the ability to store rich metadata associated with assets," such as medical images.

However, researchers and clinicians now are having to access and modify purely data files from sources as diverse as genome sequencers and electronic health records.

Another growing challenge is machine learning.

"The world will figure out in a little while what's the hype and what's the reality, but we're starting to see folks use machine learning for anomaly detection in medical imaging, for example," Godman said. "Machine learning itself brings a whole host of new requirements with it that affect IT particularly."

One such requirement is the ability to build a "training set," usually with human intervention, to train algorithms. "That creates enormous demands for read throughput," meaning the ability to read the same dataset over and over for training individual machine learning algorithms.

"As machine learning sweeps over all of these spaces [including life sciences], we are going to see the need for large storage systems that can deliver enormous amounts of read throughput," Godman said.

Genomics also brings with it unique security challenges, as Godman noted by saying that it is impossible to anonymize patients when genome sequences are present.

"A lot of folks are asking a lot of questions around how we maximize the amount of security we can bring to bear on these clinical applications when even having custody of the asset, which is a genome itself, reveals the identity of the individual from which it comes," he said. "In the end, you have to be amazingly good at keeping these assets secure," he said.

He referenced the so-called Golden State Killer case, in which police this year identified a suspect in a decades-old string of rapes and murders in California, reportedly thanks to data uploaded to the free DNAmatch database.

"It seems to have also stirred up a lot of interest around" making data accessible, but with the right security, Godman said. "I think that that case is sort of a wake-up call."