NEW YORK (GenomeWeb) – Cloudera, a developer and provider of a secure data management and analytics platform, plans to train about 1,000 precision medicine researchers over the next three years on big data technologies and data science techniques.
Shawn Dolley, Cloudera industry leader, health & life science, told GenomeWeb that the company will provide over $3 million in software, training, and services to researchers in academic and government research institutions. The company will also provide no-cost subscriptions to the Cloudera platform for storage, processing, and analyzing big data to researchers at some 50 institutions. Lastly, Cloudera will work with researchers at these institutions to help them securely share data with collaborators at other institutions.
All of these initiatives are part of the company's commitment to the White House's Precision Medicine Initiative.
Cloudera plans to begin accepting applications from the research community soon. The firm will specifically seek applications from individuals involved in non-commercial research who plan to make their results and software open. Researchers from these institutions should be involved in precision medicine, genomics, epigenetics, or other omic disciplines, or be in a position to work in the space in future. They also should be interested in merging clinical data with omic and/or environmental data and in a position to gather, host, and analyze public and private data. The company will provide additional details about its selection criteria and application process at a later date.
Dolley told GenomeWeb that Cloudera is still working out the exact details of both the training and software it will provide. The firm markets the Cloudera Enterprise platform, which is a data management and analytics platform built on Apache Hadoop and a number of open-source technologies. The company's portfolio also includes professional services, support resources, and training.
"Cloudera is not a precision medicine organization. We are not a bioinformatics organization. But we are a big data information company," Dolley said. "The training that we'll probably be providing is to expose researchers who know a lot about clinical phenotype and genotype data ... [to] big data technologies that maybe they didn't need in the past."
This is especially important as genomics-based technologies, which have enabled researchers to generate and analyze large quantities of biological information, continue to gain ground and are utilized in the clinical space.
"I'm personally so excited about [the] PMI, because I think it is one of the most obvious ways in which big data can make life better for real people," Mike Olson, Cloudera's co-founder and chief strategy officer, said in a statement. "Catching the tidal wave of data coming from genomics and proteomics, and studying whole organisms instead of just anatomy, creates whole new approaches to prevention and treatment."
According to Dolley, researchers in the genomic space are increasingly using some of the same technologies that Cloudera already uses and supports including the Spark programming language and Apache Impala, an open-source analytic database for Apache Hadoop. As a result, more users in the space are turning to the company's platform.
"For folks who are doing downstream multi-omic analysis, the Spark language [and] the Spark access protocol to Hadoop is really becoming the default standard, and Cloudera is the largest support organization in the world for SPARK," Dolley said.
Also, "We are seeing Impala, which is now an open-source project, being a key piece of the reference architecture for what happens once I have a VCF file ... and want to merge that with EHR or phenotype data," he added. In terms of providing resources for the PMI, "I think that the technologies that will be delivered as part of the software component would include some of those."
Dolley also noted that the company has made significant investments in software security to assure the safety of sensitive patient information in its platform, and it will work with customers to ensure that they keep their datasets safe. "A lot of times when you work with researchers, they've spent time with open-source genomic portals like [The Cancer Genome Atlas] and ... their own samples, but they have less expertise with full fidelity electronic health records, and so in some cases they are coming to security for the first time," he said.
Although Cloudera isn't strictly a bioinformatics company, it is seeing increasing interest in its offerings from customers in the genomics market including the Broad Institute, which recently began using Cloudera's Enterprise system. Cloudera is also a member of a Lockheed Martin-led alliance, which includes Illumina, that seeks to develop technology for diagnosing, caring, and treating diseases while protecting sensitive patient information.
"We probably have a dozen or more organizations doing either genomic or precision medicine use cases in research or clinical interventions ... on our platform, and we see that growing very quickly," Dolley said. "We believe the use case is pooling all that [genotype] data and curating it together with ... public data sources or licensable datasets that are usually for annotating the genotypes of particular patient[s] or sample[s] and then linking it with that clinical data."