NEW YORK (GenomeWeb) – The National Cancer Institute announced today that it has launched the Genomic Data Commons (GDC), a data platform containing genomic and clinical data from cancer patients around the world.
The GDC data is accessible to researchers globally, and will serve as a core component of Vice President Biden's Cancer Moonshot program as well as President Obama's Precision Medicine Initiative. The cloud-based platform is housed at the University of Chicago Center for Data Intensive Science, in collaboration with the Ontario Institute for Cancer Research, all under an NCI contract with Leidos Biomedical Research.
At the American Association for Cancer Research's annual conference in April, NCI's Zhining Wang talked about the progress being made on the data-sharing module. Once the data is uploaded, Wang said at the time, it is harmonized by the system and made more user friendly before being made available for download. The point, Wang said, was to turn a data warehouse into a knowledge base.
The GDC will centralize, standardize, and make available data from large-scale NCI programs such as The Cancer Genome Atlas (TCGA) and its pediatric equivalent, Therapeutically Applicable Research to Generate Effective Treatments (TARGET), the NCI said. More importantly, the GDC is also encouraging any cancer researchers with genomic and molecular cancer data to upload their findings to the database so that other researchers may make use of the information.
On a conference call with reporters following the launch of the GDC, Robert Grossman, director of Chicago's Center for Data Intensive Science, said the GDC is currently built on cloud architecture and operates on a private cloud at the University of Chicago that can interoperate with the Amazon cloud. It will also soon be able to interoperate with Google and with Microsoft Azure, he added.
The data will be harmonized using standardized software algorithms, and the raw genomic data may be reanalyzed as computational methods and genome annotations improve, NCI said, further noting that the GDC also has safeguards in place to ensure the security of the data.
On the conference call, Center for Cancer Genomics Director Louis Staudt called the GDC "a real engine for precision medicine," adding that it is a way for any researcher to take advantage of the work the NCI and others have done to really dive deep into cancer and understand the relationships between seemingly important gene mutations and therapies.
One of the barriers to doing this before, according to Staudt, has been the sheer size of the data. If a researcher wanted to download all of the TCGA data in order to combine it with his or her own data, it would require three weeks of continuous downloading and about $1 million of hardware. The GDC solves that problem.
Warren Kibbe, director of the Center for Bioinformatics and Information Technology and deputy director of the NCI, added that the GDC is operating under the NIH's genomics data sharing policy, which means any study concluded under the auspices of the NIH must be shared within six months of completion. With the GDC in place, NIH researchers will now have some time to get their data in the right format before submitting to the GDC — they'll have six months to publish papers based on those findings before the data becomes accessible to any qualified researcher. If there's some reason of public good to share data from an uncompleted study in real time, Kibbe added, the GDC would be a perfect way to do that.
The GDC is the "foundation of moving forward," he noted, a new way of how we think about making genomic data accessible.
Speaking to GenomeWeb after the call, Grossman also detailed some of the capabilities he hopes to help roll out in a future version of the GDC. For example, he said, now researchers can submit their data and it will be harmonized with the data in the system. "Later," he added, "we're going to add a capability to the GDC so that you can take your data and have it analyzed in the context of the GDC" — in other words, researchers would be able to do work with their data and the data already in the system directly on the cloud instead of having to download it.
Allison Heath, GDC lead architect and Grossman lab researcher, further noted that it might be possible at some point to devise secure workspaces on the GDC cloud, making it easier for researchers to collaborate on the same data and on the same projects.
The architects of the GDC have also built in a capability for researchers to eventually be able to upload informatics pipelines they may design or develop to analyze genomic data. Though that particular capability hasn't been activated yet, it might become a feature at some point in the future.
"The informatics capability is a critical part of the design," said Heath. "The technical backend of the GDC is built to accept that. But we have to go through the pilot phase to see what the best way to integrate it would be. The initial phase is to get the data harmonized and get that released, and then over the next year or two, we'll have multiple phases where we'll have other things that it has been built to do. But we need to make sure we test and work to make sure we support the community."
Grossman added that right now, the system will be going through "a period of exploration where the GDC is part of the larger ecosystem that includes the Bionimbus [Protected Data Cloud] and other clouds that interoperate to the GDC API [application program interface]." During that period, he said, "if you're using Bionimbus or any of the NCI cloud pilots, or other systems that are securely and compliantly interoperating with the GDC, you could do computations on those systems. Over time, the best practices of the NCI cloud pilots and other best practices from people who interoperate with the GDC will be integrated into the next-generation version of the GDC."
Grossman also elaborated on the security measures that are being taken to ensure the safety of the data uploaded to the GDC. The GDC's API will check whether data is considered to be public access or controlled access and will then check a research institution's authorization before granting any researcher access to the information. And authorizations are checked through dbGaP. "'In the cloud' does not mean anyone can do what they want," Grossman said.
Eventually, Grossman added, the goal is to get away from researchers having to download data in order to use it. "We want the GDC to be the first of a new generation of systems that are able to work out data at scale through APIs."