NEW YORK (GenomeWeb) – Researchers at Dana-Farber Cancer Institute and the Ontario Institute for Cancer Research will participate in pilot projects to test the efficacy of a cloud infrastructure that uses Intel technology and is intended to make it easier for researchers and clinicians to explore disparate datasets.
Dana-Farber and OICR both said last week that they have joined the Collaborative Cancer Cloud (CCC), an Oregon Health & Science University-led initiative that aims to provide secure infrastructure for sharing and analyzing large quantities of oncology data. Planned projects will focus on datasets collected from patients with prostate and other tumor subtypes.
The CCC is an implementation of platform-as-a-service infrastructure being developed by Intel to enable hospitals and research institutions to share private genomic, imaging, and clinical datasets without compromising the privacy of the contributing patients. It uses Intel-developed technologies to remotely query oncology clinical and research datasets held by institutions that have agreed to share their information.
When OHSU and Intel unveiled the platform last year, they announced that they would run a test in the first quarter of 2016 that would include two other undisclosed cancer institutions that would validate the infrastructure and evaluate how well it works in actual clinical use before the infrastructure is released for broader use.
"OHSU was a fantastic partner to get this started but if [we] are going to create a network, we need more than one partner," Ketan Paranjape Intel's general manager life sciences, told GenomeWeb last week. OHSU and Intel worked together to craft potential use cases and identify potential partners with sufficient datasets and the willingness to put the system through its paces. OICR and Dana-Farber had many of the same issues that the CCC platform seeks to address including finding effective ways of processing large datasets, sharing information securely, and scaling up analysis pipelines, he said, "so there was a natural synergy there."
Given the extent of its involvement in different genomics-based projects, "it is very much in OICR's interests and expertise to work with other institutions to develop standard protocols for jointly analyzing and distributing these large and complex datasets," Lincoln Stein, director of OICR's informatics and bio-computing program and a molecular genetics professor at the University of Toronto, told GenomeWeb. The center is involved in multiple genomics-based projects including the Cancer Genome Collaboratory, a cloud-based infrastructure intended to host genomic data collected by the International Cancer Genome Consortium and offer bioinformatics tools for analysis. OICR also serves as the data-coordinating center for the ICGC and is involved in the development of the National Cancer Institute's Genomic Data Commons. It also serves as the organizing institution for the Global Alliance for Genomics and Health.
OHSU approached OICR about a year ago about participating in the CCC and convinced the institute to sign on as a co-developer and beta tester, Stein said. "We had quite a few meetings where we talked about the high-level vision for the CCC as well as details about how it works, [and] what protocols are involved in using the system, and we liked very much what we saw," he told GenomeWeb.
The CCC's unique technical capabilities are what attracted the two research institutions to join the collaboration. "No one knows what the magic bullet or secret sauce is going to be so I think it's important to pursue several different parallel approaches," Stein said. "This one is particularly interesting because on the one hand it uses specialized hardware from Intel [including] bespoke systems for accelerating some of the common algorithms for sequence analysis [but] at the same time, it's built on top of a very sensibly designed system for moving software around to where the large datasets are in order to perform analyses in a way that's very efficient."
A distributed system like the CCC lets institutions keep their data in house giving access only to authorized users. Researchers at participating CCC sites can submit and run analysis jobs on data irrespective of the location of the data. The Intel system also has a trusted platform computing chip built in which makes it harder to hack and reduces its vulnerability to malware. "Given the sensitivity of human genomic data, we would like to use as secure hardware as we can get," Stein said.
Other features that made the platform attractive to OICR were some of the software tools it offers. One of these is a software product called TileDB, originally developed by researchers at the Massachusetts Institute of Technology and now being optimized and modified for use with the CCC. TileDB is a column-oriented database for storing data that have positional information associated with them them such as genomic variants. "It enables you to store information on a large number of variants in a relatively small space and retrieve it quite fast," Stein said.
Like most cancer centers, "we are most interested in combining datasets to generate the best possible science but also the best possible patient information for our patients," Ethan Cerami, director of Dana-Farber's knowledge systems group and a lead scientist in the institute's biostatistics and computational biology department, told GenomeWeb. "We are looking at a number of initiatives including the CCC to enable us to break down those siloes and integrate data across multiple centers."
Like OICR, Dana-Farber was impressed by the mechanisms Intel put in place to ensure patient privacy and keep datasets safe. "It's an exciting initiative for us to be able to share genomic data across centers ... and do joint computation on datasets across multiple cancer centers ... in a way that preserves the confidentiality and security of the datasets that we've already generated," he said. "[It's] a very compelling platform."
What also appealed to Dana-Farber was that, as opposed to some other cloud computing infrastructure, the Intel infrastructure is located at the center. "We completely control the clusters, we can put whatever data we want on it and we can selectively share that data through the mechanisms that Intel has put in place," he said. "Other centers can push their code to our data, process their code on our data, and just the results of that computation are sent back to them so we can very carefully control the type of data access that goes on."
For its role in the pilot, OICR will provide several hundred prostate cancer genomes for analysis, Stein said. The datasets come from the Canadian Prostate Cancer Genome Network which aims to better predict treatment failure for intermediate risk prostate cancers. The goal of the project, Stein said, will be to use the CCC infrastructure to find prognostic and predictive biomarkers for the cancer subtype.
"We'll be running the biomarker discovery software in Oregon on data that's physically stored on the node in Toronto," he told GenomeWeb. "We've done the biomarker discovery already using our local cluster so we know what we are expecting to get out. The test is really going to be do we get the right answer out? How long is it going to take?" They also want to figure out what modifications would be needed to get software to run on the CCC as well as potential problems end users might run into when they use the system.
In addition to providing datasets, OICR will adapt and optimize some commonly used genomic alignment and variant calling algorithms to run in the CCC environment, and put them into a library of tools and pipelines for genome analysis in the cloud, Stein said. These pipelines will let users upload raw output from sequencing instruments and obtain lists of variants found in tumor samples including nucleotide substitutions, insertions and deletions, structural variants, and more.
Meanwhile, Dana-Farber is still mulling which datasets it will use for the CCC pilot project. Cerami said the center plans to use datasets from some of its existing research-oriented projects initially but it is working out which datasets those will be and which cancer subtypes it will focus on. "We have a number of ideas but it's too premature to say what exactly they are," he said, adding that "we have to make sure that each of the member institutions have enough [samples with] those kinds of cancer types to do it for the pilot project."
Once they have validated the system on research samples, the developers hope to use it to share data from Dana-Farber's 'Profile' project, which was launched in 2011 by scientists at the Dana-Farber Cancer Institute and Brigham and Women's Hospital. The project aims to analyze data from tumor tissue collected from patients receiving treatment of all types of cancers. Investigators on the project, which now includes Boston Children's Hospital, have built a database that contains genetic abnormalities that drive cancer obtained from more than 15,000 genetic profiles of patients' tumors. They add about 400 new samples each month to the database. "Eventually part or all of the Profile project could be something that we securely share [but] the initial pilot projects will be slightly different datasets," Cerami said.
Like OICR, Dana-Farber also has a number of internally genomic pipelines that it may contribute to the cloud platform so that others could use them "but we haven't really defined that yet," Cerami said.
In future, Intel plans to make the CCC infrastructure more broadly available. "We are thinking of talking with the large cancer consortiums [for example] rather than talking to one partner at a time," Paranjape said. Also, if researchers want to use a single component of the CCC platform, they'll be able to do so. Intel is also exploring potential use cases for the infrastructure in the context of rare diseases as well as applications in the pharmaceutical industry, among others, he said.