The University of California, Santa Cruz, is building a petabyte-scale data repository that will provide access to genomic and clinical data generated by several projects led by the National Cancer Institute's cancer genome research programs.
The resource, called the Cancer Genomics Hub, or CGHub, is currently available as a beta release. It will be maintained by a team led by David Haussler, a professor of biomolecular engineering at UCSC who also oversees the UCSC Genome Browser and Cancer Genome Browser, the data collection center for the Encyclopedia of DNA Elements project, and a number of other large-scale bioinformatics resources.
The project is funded by the NCI through a $10.3 million subcontract with SAIC-Frederick, the prime contractor for the Frederick National Laboratory for Cancer Research.
These funds will support the development and operation of the database through 2014, Haussler told BioInform.
CGHub is initially designed to hold five petabytes of data and to allow additional growth as needed. The UCSC researchers are also exploring new data-compression approaches to reduce the amount of storage that will be necessary.
The resource is located at the San Diego Supercomputer Center and is connected by high-performance national research networks to major centers nationwide that are participating in these projects, including UCSC.
Haussler's team designed the storage and computing infrastructure for the repository, which has an automated query and download interface for large-scale, high-speed use. It will eventually also include an interactive web-based interface to allow researchers to browse and query the system and download custom datasets.
In a statement, Haussler noted that providing researchers with "comprehensive catalogs of the key genomic changes in many major types and subtypes of cancer" will support efforts to develop more effective ways to diagnose and treat the disease.
CGHub will hold data from three major NCI cancer genome sequencing programs: the Cancer Genome Atlas; the Therapeutically Applicable Research to Generate Effective Treatments, or TARGET, project; and the Cancer Genome Characterization Initiative.
TCGA is a collaborative effort led by NCI and the National Human Genome Research Institute to map the genomic changes that occur in at least 20 major types and subtypes of adult cancer. TARGET is a related effort focusing on the five most common childhood cancers, and CGCI makes available genomic data from HIV-associated cancers and certain lymphoid and childhood cancers.
These projects are generating sequencing data at a scale "far beyond anything faced in medical research before," Haussler said. Currently, TCGA generates about 10 terabytes of data each month and its output is expected to increase tenfold or more over the next two years.
Furthermore, over the next four years, if the project produces a terabyte of DNA and RNA data from each of more than 10,000 patients, it will have produced 10 petabytes of data, Haussler said, noting that 10,000 cases is a small fraction of the 1.5 million new cancer cases diagnosed every year in the United States alone.
The database currently holds about 10,000 files containing sequence information from TCGA, Haussler said.
The database includes an open source software package called Gene Torrent, which will allow researchers to move data back and forth between the hub and their home institutions, Haussler said.
It also includes scripts that enable users to search for files on specific cancer samples from particular institutions, for example; as well as to automatically transfer to the user new files that match specific search criteria, he said.
CGHub also offers an application programming interface that will allow analysis pipelines to interact with the database and retrieve the CGHub index of files; select files for download based on metadata attributes such as cancer type, sequence type, source sequencing center or date range; then initiate download; and finally confirm success.
CGHub will rely in part on a genomic network operating system developed by Annai Systems, dubbed Annai-GNOS, that will enable the transfer and management of the database's genomic data.
Annai's system includes three levels of protection for managing sequence data and metadata: user authentication, data access authorization, and secure data transfer over Transmission Control Protocol and Internet Protocol sessions using symmetric key encryption, the company said.
According to the company, the Annai-GNOS system enables the simultaneous transmission of hundreds of large data files — several hundred gigabases in size — at speeds up to multi-Gbits/sec, limited only by the input/output rates of client and networking environments.
'A More Flexible System'
Prior to launching CGHub, data from NCI's genomic projects was housed by the National Center for Biotechnology Information, Haussler told BioInform.
NCI's decision to launch a separate resource for cancer genomics data stemmed in part from a desire to "have a more flexible system where we could experiment with multiple analysis pipelines that exist close to the data and different ways of distributing the data," he explained.
One approach that the CGHub development team is adopting is encouraging researchers to "co-locate" their own compute infrastructure near the database, which should make it easier to access the data instead of trying to physically ship large datasets between institutions, he said.
So far, CGHub has one rack of machines from University of California, Berkeley, that is co-located near the database and Haussler expects more groups to follow suit. This would ultimately create "a community of analysis platforms that are all built around a central database," he said.
For groups that prefer not to co-locate equipment, CGHub's developers suggest that they work with cloud computing providers in order to obtain sufficient computation and storage space needed for their datasets.
The CGHub team is also exploring new data compression schemes that are expected to reduce the total storage space needed to hold the data, which Haussler said is "extremely important" given the size of the data sets involved.
"We would like to compress the data down to one tenth of its current size and that will not be possible without losing some information," he told BioInform. At present, the cancer genomics community is "working very hard to decide what information we can sacrifice in these very valuable data."
He said that the group is exploring two compression approaches; one developed by researchers at the European Bioinformatics Institute, called CRAM, and a second that is designed by the NCBI.