ALBUQUERQUE, NM (GenomeWeb News) — The rate at which data are being produced shows no signs of slowing and, according to Philip Bourne, the new associate director for data science at the National Institutes of Health, is still increasing.
During a keynote lecture at the Association of Biomedical Resource Facilities annual conference, held here this week, Bourne said that by 2020 the global digital universe is forecasted to comprise some 40 trillion gigabytes, though he noted that social media makes up good chunk of that data.
Still, there's a "huge amount of potential information that can be accessed," he said.
Likewise, in GenBank as elsewhere, there's been an "incredibly rapid growth of data," Jennie Larkin, a program officer at the National Heart, Lung, and Blood Institute said during a subsequent session at ABRF focused on the issue of big data.
"The cores are generating a lot of the problem — I mean, the big data," Larkin said to audience laughter.
It's not only, she added, an increase in the volume and size of the datasets, but also the types of data. For instance, she noted that about a decade ago much of the genomics data generated was gene expression data, but it's now expanded to include sequencing, imaging, and other types of data. And these data, Larkin said, need to be handled together, rather than separately.
NIH, the ABRF speakers said, is working to get a handle on how to deal with big data, but needs input from the community.
NIH not only hired Bourne to his newly created position of associate director for data science position, but has also launched the Big Data to Knowledge (BD2K) initiative.
BD2K, Larkin said, intends to jumpstart the field. It is focused on four objectives, namely to make data usable, to enhance data analysis techniques, to improve training, and to develop centers of excellence for data science. NIH has issued a number of RFAs, some that have recently closed and others that are still open, in these areas.
For instance, for the first aim, Larkin said NIH is working on developing a Data Discovery Index that would make data both findable and citable. While the RFA for the coordinating center has closed and is under review, Larkin noted that community-based data and metadata standards are needed. It wouldn't be helpful to be able to find and cite data if the data weren't useable, she said.
Here, she noted, the agency is working to identify existing standards and gap areas as well as support community-based efforts to develop standards.
Similarly, to facilitate data analysis, NIH is working on a Software Index that would work with, though be separate from, the DDI. Additionally, it is focusing on cloud-based analysis as well as data compression, visualization, and provenance methods.
On the institute level, Warren Kibbe, the director of the biomedical informatics and information technology center at the National Cancer Institute, said that NCI, which is grappling with the large volume of data generated by The Cancer Genome Atlas and the Therapeutically Applicable Research to Generate Effective Treatments project, is embarking on a pilot project to use the cloud to analyze the reams of big data.
Currently, datasets are still often downloaded to local computer workstations for analysis, but as datasets get larger and begin to include more orthogonal data, downloading time and storages space for the data become problematic.
"The problem with that is that it just doesn't scale," Kibbe said. He noted that it would take 23 days to download the 2.5 petabytes of TCGA data at 10 gigabytes per second.
For the NCI Cancer Genomics Cloud Pilots, both the data and analysis tools would be co-located in the cloud, removing the need for downloading and locally storing such vast datasets.
NCI also has similar efforts such as the Genomics Data Commons, a repository of NCI-funded genomic research, Kibbe said.
Still, much of the 'how' of data sharing, Bourne noted, is still being hashed out, and NIH is seeking input from the community on what it can do to facilitate sharing.