NEW YORK (GenomeWeb News) – The University of California, Santa Cruz has used $10.3 million in funding from the National Institutes of Health to establish a new repository and user portal for storing and providing access to the massive volumes of data being pumped out by cancer genomics researchers each day, UCSC said today.
The new Cancer Genomics Hub, or CGHub, was supported by a grant from the National Cancer Institute through a subcontract with SAIC-Frederick, the main contractor for the Frederick National Laboratory for Cancer Research.
Currently in a beta release phase, the CGHub was designed to support data from NCI's large genome sequencing programs, including The Cancer Genome Atlas, the Therapeutically Applicable Research to Generate Effective Treatments program, and the Cancer Genome Characterization Initiative.
"By providing researchers with comprehensive catalogs of the key genomic changes in many major types and subtypes of cancer, these efforts will support the development of more effective ways to diagnose and treat cancer," David Haussler, a professor of biomolecular engineering in the Baskin School of Engineering at UCSC and a Howard Hughes Medical Institute investigator, said in a statement.
Located at the San Diego Supercomputer Center and managed by the UCSC team, the CGHub is linked to national research networks and centers around the country. It offers an automated query and download interface for large-scale, high-speed use, and it eventually will provide an interactive interface that will allow researchers to browse and query the system and download custom datasets via the web.
UCSC said there is "an urgent need for an efficient and user-friendly portal" to enable researchers to access the "staggering amounts of data" being generated by NCI's cancer genome projects. The university said that TCGA puts out around 10 terabytes of data each month – compared to the 45 terabytes created by the Hubble Space Telescope over two decades – and that output is expected to increase tenfold over the next two years.
"The scale of this [data deluge] is far beyond anything faced in medical research before," Haussler said. "Each genome file, the DNA record from a tumor or normal tissue, is 300 billion bytes. And for every case there are two of these files, the cancer genome and the normal genome. Add to this RNA sequence data, and the prospect of deeper sequencing in the future, and we must plan for up to a terabyte for each case."
Over the next four years, UCSC estimated, the TCGA alone could produce 10 petabtytes of data.
The CGHub currently is designed to hold 5 petabytes of data, with the expectation that it will grow later and that new data compression techniques will reduce the total storage space required.
"Right now, cancer research needs something on a very large scale, like the Large Hadron Collider in physics," Haussler added. "Instead of bringing subatomic particles together in high-energy collisions and computing their behavior, we're bringing cancer genomes together in a common database and computing the disease drivers."
The center's core code and software for downloading data, the Annai-GNOS system, was licensed under a new multi-year agreement from Annai Systems, the Los Gatos, Calif.-based company said today. That system enables the transmission of hundreds of big data files at high speed, and it provides three levels of security to protect sequence data, Annai said.