Skip to main content
Premium Trial:

Request an Annual Quote

Cloudera Bets Its Future on Scalability for Spark, GATK Support


CHICAGO (GenomeWeb) – When Shawn Dolley, global industry leader of health and life sciences at Cloudera, joined the Palo Alto, California-based software vendor in 2014, the top bioinformatics use case for Cloudera's technology was downstream variant stores. It still is, but perhaps not for much longer.

"Our world starts with the downstream," Dolley said, ticking off names of some Cloudera users, including Baylor College of Medicine and Seattle Children's Research Institute.

Then he got to the Broad Institute, which teamed with Intel, Google, and Cloudera to build version 4.0 of the Broad's Genome Analysis Toolkit.

"Broad wanted their most recent version of GATK to be completely developed in an open-source language called Spark," Dolley noted. Spark "is becoming the lingua franca of research computing pipeline generation," Dolley said.

"One of the trends that we have seen recently is that while Cloudera initially was a support organization for some of the big data technologies," he continued, "now probably a third of our demand is from folks who are doing computational pipelines and they want that to be in Spark."

Now, Cloudera — of which Intel holds an 18 percent stake — is among the largest providers of support for Spark when clinical data is involved, and is also a key cog in Hail, a Harvard-developed open-source variant store based on Hadoop, Apache Parquet, and Spark technologies.

Dolley made the bold prediction that as GATK4 gets widely adopted, the Hail variant store will be the "death of ADAM," a genomics data processing platform developed at the University of California, Berkeley. He said ADAM is "flawed to some degree."

Cloudera's pivot to Spark got a boost in early 2016, when the Obama administration invited Cloudera to join the public-private partnership now known as All Of Us. Cloudera offered to contribute by giving away $3 million worth of training, software, and services to US academic institutions to build out their big data capabilities.

All of Us Director Eric Dishman came from Intel, and Dolley knew him there because the two companies worked together on machine learning and patient-level prediction even before Intel invested in Cloudera.

"The other big thing that happened at the time was that we had had a number of customers building their own variant stores," Dolley said. One of them, Benjamin Neale of the Analytic and Translational Genetics Unit at Massachusetts General Hospital and of the Broad Institute, needed a platform for analyzing a large genome-wide association study, so he enlisted a couple of Harvard mathematicians in his lab to put together Hail.

"We love it. We actually think that this is the best and most scalable architecture for folks who want to do half a million samples," Dolley said of Hail.

"As we move away from purely germline genetic and genomic work out into proteomics and metablomics and transcriptomics, I would say the value prop from Cloudera is that we are the largest and most robust organization for Spark, which was the tie-in for GATK4. I would say that the 22,000 active users … as they start migrating to GATK4, which I think will be next year, are going to need to build out Spark capabilities," Dolley said.

"We're increasingly standing up health systems on the Hail-Cloudera variant store because it's low-cost," Dolley said. "You don't have to buy a DNAnexus. You don't have to buy a WuXi NextCode. You just have to have servers if you want to have this on premises, and most health systems are not really ready to put their clinical data on the cloud."

Indeed, Cloudera supports cloud setups on Amazon Web Services, Google Cloud, and Microsoft Azure, as well as in-house installations of its technology.

Organizations often adopt the free version of Cloudera's platform, but then end up becoming paying customers, for a number of reasons, Dolley claimed. "One main one is our tooling ... for machine learning and prediction. We have tooling for HIPAA. We have tooling for if you want to be in two cloud providers and [on premises] at the same time," he said.  

Some of the upstream alignment and assembly of pipelines might be suitable for such environments, but not more complicated processes like running GWAS and merging it with clinical data or performing natural-language processing. "You have to distribute it," according to Dolley.

"Could they write those things if they wanted to? Yes, they could. But guess what they'd rather do? They'd rather investigate the data and look at outcomes and rare variants and operate at the top of their licenses," he said.

So much time in genomic informatics is spent on waiting for the computer to process data, which is why Spark has become popular because it can run in parallel. "At this scale, you just can't do it on [graphics processing unit]. You can't do it in the [high-performance computing]," Dolley said.

"At the tip of the spear, I want to make sure we're very relevant when you get to a hospital pathology department dealing with a single patient," Dolley said.

"Scale we have." Dolley said that he has heard from so many who are not ready to handle such scale. Seattle Children's Research Institute, for example, wants to have a genome-based diagnosis on Friday for complicated patients who come in on Monday. "The middle three days is doing queries, tying contextual data together, and watching the computer churn," limiting the institute to one diagnosis a week, Dolley said.

That was not enough for the volume of complicated patients SCRI is getting, so the institute brought in Cloudera and other distributed computing tools and now can make five such diagnoses per week with the same number of staff.

"All the time delay in the middle of the week, none of that is computer-oriented or script writing-oriented," Dolley said. "It's not glamorous. It's letting a researcher be a researcher, not a computer scientist."