Skip to main content
Premium Trial:

Request an Annual Quote

Inova Bullish on Cloudera Technology for Genomics, but Still in Early Days


CHICAGO (GenomeWeb) – Cloudera may be headquartered in Palo Alto, California, but Shawn Dolley, the company's global industry leader of health and life sciences, lives near Washington, D.C. That gives him a front-row seat to the transformation of healthcare to precision medicine, courtesy of Inova Health System, one of Cloudera's largest healthcare customers.

Inova, based in Falls Church, Virginia, is among several Washington-area health systems that advertise the fact they offer genomics testing and precision medicine services. "It is routine to have genomics as a component of your medical care," Dolley said.

"It's surreal," he continued. When Dolley joined Cloudera three years ago after leading the healthcare division of Netezza, a data warehousing and analytics company that IBM bought in 2010, few large health systems were working with genomics.

Cloudera's cofounder, chief data scientist, and angel investor Jeff Hammerbacher — who struck it rich as an early Facebook employee — was and still is involved in genomics in his laboratory at Mount Sinai Health System in New York, but precision medicine was nearly unheard of at community-based providers like Inova.

The Inova Translational Medicine Institute existed when Dolley joined Cloudera in 2014, but it was nascent. "What happened since then was that folks went from microarrays to whole exomes and now, whole genomes," Dolley said of the entire world of clinical genomics.

Being near the nation's capital, Inova and its local competitors have a bit of an advantage over community health systems in other parts of America.

"In D.C., where Inova lives, we have a unique genomic footprint in that every country in the world sends people to D.C. to represent those countries," Dolley said. This gives Inova a more diverse clinical data set and variant store than just about any other place in the US.

Before turning to Cloudera, Inova researchers spent 80 percent of their time collecting and managing data rather than running analytics, according to Aaron Black, chief data officer at Inova Translational Medicine Institute.

Starting in 2015, Inova ran a yearlong pilot with Cloudera and the Institute for Systems Biology, which provided data scientists. Inova built a computing cluster with Cloudera technology on the Amazon Web Services cloud platform that ingested nearly 8,000 whole genomes. The source data was about 2.5 petabytes, but after filtering down to just the variants, the store took up 20 terabytes, which Inova had to normalize and load into the cluster in Apache Impala, Black said.

It contained millions of rows, but it had to be scalable, Black said. "The cloud was much easier than buying a bunch of hardware and tinkering with it."

But the setup proved to be expensive based on monthly cloud-hosting fees, Black said, because it was up 24/7, not on demand. Since Cloudera is compatible with local hosting as well as the big three cloud services — AWS, Google Cloud, and Microsoft Azure — the decision was made to bring some of the processing in house for better performance and lower cost, Black said.

Working with Cloudera and Intel, Inova set up an on-premises cluster that went live in January. "We ingested the data in Amazon and we didn't have to redo it," Black said. "The data was portable."

The cluster writes data in parallel, so it only took about 10 hours to transfer the 20 TB of data to the local host, Black reported. It's now all backed up in AWS.

At last month's American Society of Human Genetics meeting in Orlando, Florida, Jerry Liu, senior bioinformatics scientist at Inova Translational Medicine Institute, gave a presentation on how Inova data scientists applied next-generation sequencing analysis and machine-learning tools for phenome-wide association studies while running in a parallel computing environment to make researchers more efficient.

"We had a lot of success, but there were a lot of things that we didn't realize until we actually did it," Black said. "This world is so new, it's not like you can go to a reference architecture. We don't have anything like that." Data engineers and scientists have to work together in ways they had not before.

Still, clinical genomics and the Cloudera installation that supports it at Inova are pretty new. "It's got a lot of potential, but we're just like in the first inning. We're just at the cusp of it," Black said. In fact, ITMI continues to evaluate whether to invest further in this technology.

"There's a lot of people out there that do this, some well, some not. Some will be more marketing than reality, and we're just starting to dig in. We feel like we found the right partner, but we're still very early," Black said.

Inova's goal is to open data access to as many people as want it in a secure environment. "We've been supporting ITMI and their scientists quite a bit, but Inova's vision is larger than just one institute or one particular type of disease," Black said. Inova Health System wants to bring the ITMI tools to other institutes that rely on genomic data.

"A lot of these systems are now bursting so you can now connect directly to these cloud providers," Black said. "In the future, you're not even going to have to move [the data]. As long as you have the secure connection to these clouds, a lot of the analytics can be done there and the results returned."