The National Cancer Institute's Board of Scientific Advisors and the National Cancer Advisory Board last month approved a proposal to launch three Cancer Genomics Cloud pilots.
The pilots, presented during a joint meeting of the boards, seek to develop computing environments that provide access to co-located computing resources and storage and are pre-loaded with data from the Cancer Genome Atlas.
During his presentation at the meeting, George Komatsoulis, NCI's chief information officer and interim director of its Center for Biomedical Informatics and Information Technology, said that the infrastructure would address limitations of current compute infrastructure used to manage and analyze data from NCI-funded projects like The Cancer Genome Atlas. He noted that while the current model of downloading and running analysis locally works well for smaller quantities of data, it quickly becomes costly and increasingly unsustainable for larger and more varied datasets like those generated by TCGA and similar projects.
By the time it wraps up in September 2014, TCGA is expected to have generated about 2.5 petabytes of data — it's currently produced about half a petabyte — not including "orthogonal data types" such as gene expression, copy number, epigenetic data, clinical annotations, and so on, he said. Conservative estimates of the costs of storing and protecting the raw TCGA BAM files using current infrastructure are approximately $2 million per year — which breaks down to 6.6 cents per gigabyte per month. Furthermore, assuming a dedicated 10-gigabit/sec network for data transfer, it would take users 23 days to download the raw data.
A biomedical cloud would address these limitations by providing both high-performance computational capacity and storage space so that users, including those with financial constraints, can access and use the data, Komatsoulis said. Researchers also wouldn't by constrained by the time required to download the entire dataset since they would only need to move their own internal data and tools to the cloud. Both of these were concerns raised by the cancer research community, he said, in response to a letter sent out in April that asked for input on developing cloud infrastructure for cancer genomic data (BI 4/19/2013).
The NCI intends to fund the development of three pilot clouds that can each handle up to 2.5 petabytes of core TCGA data with at least one additional data type. The clouds are expected to support large numbers of users simultaneously; use defined data standards; and be easily interoperable with other databases and systems. Since there are three separate groups developing clouds, the NCI expects to see a broad array of infrastructure that cover the scope of the community's varied computing needs. Developers will also be expected to provide sustainability plans that include cost assessments for operating at current scale and at 10-fold increases in storage, compute, and usage.
At the center of the cloud infrastructure would be a cancer genomic data commons which would be responsible for collecting, running quality controls, and aggregating the TCGA data and creating a core dataset — composed of DNA, RNA sequence data, and clinical annotations — that would reside in all the clouds. Users who want access to both data and compute power could then use the cloud for analysis while those who have sufficient local capacity could download the data directly from the commons, Komatsoulis said. He told BioInform that the Cancer Genomics Hub — a TCGA repository developed and maintained by the University of California, Santa Cruz (BI 5/4/2012) — would be part of the genomic data commons.
Once winning proposals have been selected, participants will have three months to design their systems; 12 months to construct the cloud including an application programming interface and initial users' programs as well as to create operational cost estimates. The pilots will them be evaluated over a six month period by both the NCI and the community.
The NCI, he said, will evaluate the system in terms of whether or not the pilots meet the requirements set forth in each submitted proposal and also whether the systems handle the core data effectively. The community will also have a chance to evaluate the usefulness of these clouds to their research by participating in contests similar to those offered by firms like TopCoder. Komatsoulis told BioInform that these challenges will offer incentives to community members to develop new applications and capabilities and test them on the clouds.
Estimated costs for the design and implementation phase are expected to be between $3 million and $5 million per cloud pilot, while the evaluation period is expected to cost about $500,000 per pilot. Meanwhile, the ballpark estimate for the operational phase if the pilots are successful is between $3 million to $5 million per cloud. Komatsoulis noted during his talk that costs for the operational phase will likely vary depending on a number of factors, for example the decision to use commodity clouds — which have low capital but higher operating expenses — or dedicated hardware — which have higher capital but lower operating costs.
The NCI's proposal also includes several possible long-term support models. Under one model, the NCI would take on the responsibility of funding and expanding the most successful clouds from the pilot phase; in another scenario, institutions could form consortia that would be responsible for supporting cloud instances; and a third option would be a commercial-fee-for-service offering run by existing cloud providers, for example. Komatsoulis said he expects that the final funding model will likely be a "hybrid" of the three options.
As part of efforts to make the clouds proposal a reality, the NCI issued a research and development sources notice about two weeks ago asking qualified groups such as small business and academic institutions to submit statements about their availability and capabilities by July 17.
Komatsoulis explained to BioInform that this isn't a request for proposals. The point is to "see what types of organizations out there have the potential to respond to … solicitations if they come in the future." It's not clear at this point when the NCI will solicit proposals from the community.
According to NCI's sources notice, statements should demonstrate respondents' "ability and experience in performing the requirements reflected in [the] notice." Specifically, responses should include, among other information, details about personnel with experience in both information technology and cancer genomics, examples of similar projects, and access to both office space and compute facilities.