Skip to main content
Premium Trial:

Request an Annual Quote

NCI-Led Team Builds Petabyte-Scale Cancer Genome Data Repository


In an effort to manage data generated from the National Cancer Institute's cancer genome initiatives, the University of California, Santa Cruz, has launched a petabyte-sized data storage facility. Called the Cancer Genomics Hub, or CGHub, this new resource will contain data from NCI's The Cancer Genome Atlas, the Therapeutically Applicable Research to Generate Effective Treatments project, and the Cancer Genome Characterization Initiative.

"We're looking at a potential explosion in the ability to do the sequencing as the price comes down to $1,000 per specimen at the end of this year, and the costs of analysis are going to be up there with doing the costs of sequencing," says Robert Zimmerman, director of CGHub. "The creation of a repository for all the NCI's DNA and RNA data is motivated by the desire to make it easier for people to acquire the data in one place rather than search around at the various genome sequencing centers across the country."

At present, CGHub stores roughly 10,000 sequence information files from The Cancer Genome Atlas and also contains the open-source bioinformatics software package Gene Torrent. While CGHub is built to scale up to 30 petabytes of capacity, Zimmerman and his colleagues are also experimenting with data compression algorithms, like the European Bioinformatics Institute's CRAM algorithm, to help manage space. But deciding on which compression solution will be the best option is not without its challenges.

"There's an overhead of doing the compression and decompression — various groups are coming up with different compression schemes and we're evaluating them in terms of compression and decompression times," he says. "We're anticipating that there may be some challenges getting a consensus within the community as to which compression scheme is acceptable, and that will depend on how the data is used."

CGHub is funded by a $10.3 million contract through SAIC--Frederick, a contractor with the Frederick National Laboratory for Cancer Research, and is physically located at the San Diego Supercomputer Center. Researchers can access CGHub through a 10-gigabit Internet connection, called Internet2, as long as they have the proper NCI security authorization.

Filed under

The Scan

Follow-Up Data Requests to Biobank Participants Ineffective, Study Finds

An effort to recontact biobank enrollees for additional information reports low participation in a new BMJ Open study.

Study Finds Widespread Transmission of Resistant Bacteria in Vietnam Hospitals

A sequencing study in The Lancet Microbe finds widespread transmission of drug-resistant Escherichia coli, Klebsiella pneumoniae, and Acinetobacter baumannii in two Vietnam ICUs.

Novel Brain Cell Organoids Show Promise for Autism Research

University of Utah researchers report in Nature Communications on their development of brain cell organoids to study SHANK3-related autism.

Study Finds Few FDA Post-Market Regulatory Actions Backed by Research, Public Assessments

A Yale University-led team examines in The BMJ safety signals from the US FDA Adverse Event Reporting System and whether they led to regulatory action.