The operators of the Cancer Genomics Hub — a data repository that is hosted and maintained by researchers at the University of California, Santa Cruz — have added data from the National Cancer Institute's Therapeutically Applicable Research to Generate Effective Treatments (TARGET) project to the petabyte-scale resource.
According to the team, CGHub now contains about 400 terabytes of data from TARGET, a project that aims to understand the molecular anomalies that drive the development and progression of five types of childhood cancers including neuroblastoma and acute myeloid leukemia. This is in addition to about 500 terabytes of data from The Cancer Genome Atlas program, which characterizes 25 adult cancer types and subtypes; as well as data from the Cancer Cell Line Encyclopedia (CCLE), a collaborative project between the Broad Institute and Novartis Institutes that has provided genomic information on about 1,000 cancer cell lines.
In total, the system currently holds more than one petabyte of data, and it is still collecting information from both TCGA and TARGET projects, Linda Rosewood, program director for CGHub, told BioInform. Later in the year, she said, CGhub's organizers will begin uploading datasets from the Cancer Genome Characterization Initiative (CGCI), a NCI-funded program that supports genomics research on both pediatric and adult cancers — it's not clear yet how much data from this project CGHub will host.
CGHub was established in 2012 to manage sequence data from several projects led by the NCI's genomics research programs. The resource, which was designed to hold up to five petabytes of data or more if necessary, is maintained and operated by a UCSC team led by David Haussler, a professor of biomolecular engineering at UCSC, and funded by the NCI through a $10.3 million subcontract with SAIC-Frederick, the prime contractor for the Frederick National Laboratory for Cancer Research. That subcontract is set to expire this year, and by the time the project wraps up, Rosewood says that the CGHub team will have "accomplished all of its immediate goals."
Since it launched, datasets from CGHub have been downloaded and used in projects within large and small organizations for both research and application development, Rosewood said. The team reports that in recent months, downloads from CGHub have exceeded 1,000 terabytes per month facilitated by software infrastructure developed by Annai Systems, which enables rapid data transfer even for very large datasets, and a specialized browser that makes it easy to find and download needed sequence files.
Much of CGHub's data is protected and use is restricted to researchers who request and receive approval from the National Institutes of Health — a measure put in place to protect the privacy and intentions of those who contributed data to the projects. When the CGCI data become available later this year, it will also fall under the restricted access segment of CGHub and researchers will need to have NIH approval to use it.
However, data from the CCLE are publically available providing a useful research resource and a potential teaching tool to introduce students to genomics-based research and bioinformatics analysis, Rosewood said. Unlike the other datasets, the source materials for the CCLE project "are no longer associated with a particular person" and as such, the privacy concerns that apply in the other cases aren't an issue.
CGHub is part of larger plan that the NCI has for data generated from its projects. The agency intends to build a so-called cancer genomic data commons that would be responsible for collecting, running quality controls, and aggregating data from the TCGA and similar projects. This would include the raw sequences currently in CGHub as well as higher level data such as mutation calls and so on. Rosewood said the UCSC team, along with collaborators, plans to submit a proposal for the contract to build the data commons.
Part of the plan is to develop cloud environments that would provide high-performance compute resources for data analysis and storage space so that researchers, including those with financial constraints and limited local resources, can access and use the data. To that end, last year, the NCI's Board of Scientific Advisors and the National Cancer Advisory Board approved a proposal to launch three pilots that would provide an opportunity for members of the community to propose and vote on appropriate cloud infrastructure. The NCI will release a formal broad agency announcement solicitation for the pilots on January 13, 2014.