In an effort to manage data generated from the National Cancer Institute's cancer genome initiatives, the University of California, Santa Cruz, has launched a petabyte-sized data storage facility. Called the Cancer Genomics Hub, or CGHub, this new resource will contain data from NCI's The Cancer Genome Atlas, the Therapeutically Applicable Research to Generate Effective Treatments project, and the Cancer Genome Characterization Initiative.
"We're looking at a potential explosion in the ability to do the sequencing as the price comes down to $1,000 per specimen at the end of this year, and the costs of analysis are going to be up there with doing the costs of sequencing," says Robert Zimmerman, director of CGHub. "The creation of a repository for all the NCI's DNA and RNA data is motivated by the desire to make it easier for people to acquire the data in one place rather than search around at the various genome sequencing centers across the country."
At present, CGHub stores roughly 10,000 sequence information files from The Cancer Genome Atlas and also contains the open-source bioinformatics software package Gene Torrent. While CGHub is built to scale up to 30 petabytes of capacity, Zimmerman and his colleagues are also experimenting with data compression algorithms, like the European Bioinformatics Institute's CRAM algorithm, to help manage space. But deciding on which compression solution will be the best option is not without its challenges.
"There's an overhead of doing the compression and decompression — various groups are coming up with different compression schemes and we're evaluating them in terms of compression and decompression times," he says. "We're anticipating that there may be some challenges getting a consensus within the community as to which compression scheme is acceptable, and that will depend on how the data is used."
CGHub is funded by a $10.3 million contract through SAIC--Frederick, a contractor with the Frederick National Laboratory for Cancer Research, and is physically located at the San Diego Supercomputer Center. Researchers can access CGHub through a 10-gigabit Internet connection, called Internet2, as long as they have the proper NCI security authorization.