UPDATE: this article has been updated to correct Jason Paragas' job title.
NEW YORK (GenomeWeb) – The National Cancer Institute and the US Department of Energy are partnering to bring the DOE's supercomputing infrastructure, data analytics, and expertise in algorithm development and other areas to bear on a number of cancer research efforts.
In a blog post published a little over a week ago, Warren Kibbe, director of the NCI Center for Biomedical Informatics and Information Technology, wrote that the partners spent the past 14 months planning a three-year pilot. Specifically, the NCI will work with computational scientists at Lawrence Livermore, Oak Ridge, Argonne, and Los Alamos National Laboratories on three pilot projects focused on pre-clinical cancer model development and therapeutic evaluation, improving outcomes for RAS-related cancers, and integrating information for cancer precision medicine.
The NCI-DOE partnership is rooted in discussions between the White House's Office of Science and Technology Policy, DOE, and a number of other agencies focused on computing challenges. Those talks contributed to the launch of the Exascale Computing Challenge, which aims to position the DOE as a provider of high-performance computing solutions for large-scale projects. The talks also resulted in a Presidential order issued last year creating the National Strategic Computing Initiative (NSCI) to encourage continued development of high-performance computing systems.
The talks also included an assessment of potential opportunities in the biomedical research domain that would be relevant to the NSCI, Kibbe told GenomeWeb. That led to more fine-grained talks between the NCI and DOE about ways to use the DOE's expertise in areas such as sensor systems and natural language processing, as well as its sizable supercomputing infrastructure, in cancer research.
"The DOE national labs have some of the foremost experts in computational science and mathematics," Kibbe said. "We realized quickly there was [this] tremendous opportunity ... to focus on some of the opportunities we have in cancer research where we've been generating enormous amounts of really detailed data."
For example, some NCI-funded projects generate large quantities of atomic-level protein structure and function data, which is used to explore how interactions at the atom level contribute to pathway interactions, cell behaviors, and response to therapies. There are also efforts to collect data from electronic health records across the country to feed cancer registries such as the NCI's SEER registries. "The DOE has a tremendous amount of expertise in sensor systems and natural language processing and it seemed like an enormous opportunity to build a partnership between the folks that have been applying these techniques in other disciplines and domains to cancer research," Kibbe said.
From the national labs' perspective, collaborating with the NCI was a natural fit. At Lawrence Livermore, "we were thinking about where the future of biosecurity was going ... for the next ten years in terms of the science and technology to work in that space," Jason Paragas, director of innovation at the lab, told GenomeWeb. Industries such as engineering and aerospace have been fundamentally transformed by advanced computing technologies, "but if you think about how we design drugs, it's still very much a master craftsman business," he said. "There is a little bit of compute there but not at the same level as these other industries are at."
High-throughput instruments along with the decreased cost of ownership not only make it possible for a lab to produce data in bulk but also to generate various kinds of data. "In the last five years [the way] we measure biology has fundamentally changed," Paragas said. "Omics really can describe so much of the biology at scale so that when we just do regular bench work, we generate big data."
Simultaneously, advancements in supercomputing technology have resulted in powerful machines that are capable of modeling complex systems and handling massive quantities of data. "There was this convergence of what's going on in life sciences and in computing and big data," Paragas said. "We kind of convinced ourselves that the time is now ... for biology to do that transformation like all these other industries."
The four national labs have "put together a powerful computing ecosystem both in terms of machines, people and expertise," Paragas said. Together, the labs approached the NCI about a year ago to begin discussing ways that they could use these resources to help fill some of the gaps in the current body of knowledge on cancer biology. "They listened ... challenged us quite a bit and put us through a tremendous amount of due diligence," he said. "They very quickly became intellectual partners in this with us."
As a first step, the partners have planned a three-year pilot focused on three research efforts. One pilot involves developing computational approaches for research projects performed under the auspices of the RAS initiative, an NCI-led effort launched five years ago to explore new ways of targeting proteins encoded by mutant forms of RAS genes. These mutations are involved in more than 30 percent of human cancers. The RAS initiative will generate a broad pool of molecular and functional data that could help researchers better target RAS mutants.
Since its launch, researchers involved in the RAS initiative have generated large quantities of molecular and functional data that could be used to build valuable predictive models and simulations. These models would be useful for studying the behavior of RAS mutants at the atomic level including how these changes affect the biochemical functions and properties of the gene. These studies could also provide insight into potential drugs and assays. "The goal of the pilot … is really to explore how well we can do those kinds of predictive models and they are all based on atomistic and functional data coming out of the RAS initiative," Kibbe said.
The second pilot focuses on pre-clinical model development and evaluating therapies. "The pre-clinical models piece of that is we want to be able to take tissues from tumors in a patient and create different kinds of biological models from that tumor [including] patient-derived xenografts as well as more classic tissue culture cell lines and [provide] well-characterized biological models for the cancer community," Kibbe said. This includes generating detailed genomic, transcriptomic, and proteomic data for each specimen.
These datasets will be used to build computational models and simulations of tumors. Researchers will assess the efficacy of these models for predicting patients' response to therapies as well as which therapies would be most effective for their tumors. "That will probably take far longer than three years to know if we can really do that," Kibbe noted. "The pilot is really to explore what would that look like and how would we go about creating those kinds of mathematical and computational models."
The third pilot focuses on integrating information to enable cancer precision medicine. Specifically, the NCI hopes to use some of the labs' expertise in sensor networking capabilities and natural language processing to "scale our ability to monitor cancer patients across the whole country and to do that without having as much manual curation as we currently do," Kibbe said.
Efforts here will focus on the NCI-run SEER registry, which hosts information on some 30 percent of all cancer cases in the US including data on incidence rates, patient outcomes, and long-term survival rates. "The SEER registry is a wonderful resource but it's very, very manual," Kibbe told GenomeWeb. Registrars around the country manually extract large quantities of cancer patient data from local systems and add it to the SEER registry at a rate of about 450,000 new cancer cases each year. The NCI hopes to automate portions of the data extraction and entry pipelines, freeing curators to focus on other tasks such as improving the quality of data that goes into the registry. The idea, Kibbe said, is leverage the DOE's expertise in things like sensor networks "to scale our ability to monitor cancer patients across the whole country and do that without having as much manual curation as we currently do."
The NCI-DOE partnership has a different purpose than the NCI's Cancer Cloud Pilots and the Genomic Data Commons. Both of those initiatives, Kibbe said, focus on well-characterized and curated datasets. They are intended to enable easier analysis of TCGA data by co-locating it with storage, compute resources, and analysis tools.
"The scaling of those problems and the analysis of all those genomes is a big problem but it's dwarfed by the kinds of problems that we are talking about when we want to look at the molecular dynamics and we want to look at predictive modeling from genomic data," he said. The scale and complexity of the data make it very difficult to run these kinds of simulations in the cloud in a scalable fashion. In fact, the National Institutes of Health supports a program run by the Pittsburgh Supercomputing Center that provides free access to Anton, a special purpose supercomputer for molecular dynamics simulations of biomolecular systems.
"Cloud infrastructure has gotten good enough now that we can put a lot of genomic data in it and do a lot of computation around the genome and clinical data associated with those genomes," he said. However "to do a molecular dynamics simulation in the Amazon cloud, it just doesn't scale the same way. You really do need to have these high-performance computing infrastructures to make that happen."
In terms of resources, the labs will provide hardware and software as well as its expertise in applied mathematics, machine learning, algorithm development, and advanced simulation. This will include the Titan supercomputer at the Oak Ridge National Laboratory. Titan is the second fastest supercomputing system in the world according to the most recent release of the twice-yearly Top 500 supercomputers list. "The scale of the data [in life sciences] is so big," Paragas said. "The complexity of the problem will really stretch these computers in new and unique ways. We fully expect it to drive computing in a really interesting way."
In addition to enabling cancer research, there is also an opportunity for the NCI projects to contribute to the next generation of supercomputing infrastructure at the DOE labs. Right now, the national lab supercomputers provide petascale processing speeds, Paragas said. For example, the Sequoia supercomputer at LLNL — the third fastest supercomputer according to last year's Top 500 rankings — is about 20 petaflops. The next-generation of machines coming to the national labs —named CORAL for Collaboration of Oak Ridge, Argonne, and Livermore — will offer about 150 petaflops.
After that, the labs will look to move to exascale-computing systems. "The opportunity for NCI is that when we get these new machines, we go through a co-design process [where] we build the hardware with a vendor [and then] build the operating system and the applications all together," he said. "We have the opportunity with this collaboration to really drive computing so that it really answers some of the key [questions] in life sciences."
Both partners will carve out portions of their budget to support their contributions to the collaboration. When the pilot wraps up, the partners will evaluate the initiative and plot next steps. "Part of the experiment is seeing if we can mix the culture of NCI and cancer research and the cancer research community with the community at DOE and its collaborators," Kibbe said. "The pilots are going to go really well and we will know if this experiment is successful in a couple of years."