Skip to main content
Premium Trial:

Request an Annual Quote

ABRF Panelists Discuss NIH Efforts for Sharing Big Data

ALBUQUERQUE, NM (GenomeWeb News) — The rate at which data are being produced shows no signs of slowing and, according to Philip Bourne, the new associate director for data science at the National Institutes of Health, is still increasing.

During a keynote lecture at the Association of Biomedical Resource Facilities annual conference, held here this week, Bourne said that by 2020 the global digital universe is forecasted to comprise some 40 trillion gigabytes, though he noted that social media makes up good chunk of that data.

Still, there's a "huge amount of potential information that can be accessed," he said.

Likewise, in GenBank as elsewhere, there's been an "incredibly rapid growth of data," Jennie Larkin, a program officer at the National Heart, Lung, and Blood Institute said during a subsequent session at ABRF focused on the issue of big data.

"The cores are generating a lot of the problem — I mean, the big data," Larkin said to audience laughter.

It's not only, she added, an increase in the volume and size of the datasets, but also the types of data. For instance, she noted that about a decade ago much of the genomics data generated was gene expression data, but it's now expanded to include sequencing, imaging, and other types of data. And these data, Larkin said, need to be handled together, rather than separately.

NIH, the ABRF speakers said, is working to get a handle on how to deal with big data, but needs input from the community.

NIH not only hired Bourne to his newly created position of associate director for data science position, but has also launched the Big Data to Knowledge (BD2K) initiative.

BD2K, Larkin said, intends to jumpstart the field. It is focused on four objectives, namely to make data usable, to enhance data analysis techniques, to improve training, and to develop centers of excellence for data science. NIH has issued a number of RFAs, some that have recently closed and others that are still open, in these areas.

For instance, for the first aim, Larkin said NIH is working on developing a Data Discovery Index that would make data both findable and citable. While the RFA for the coordinating center has closed and is under review, Larkin noted that community-based data and metadata standards are needed. It wouldn't be helpful to be able to find and cite data if the data weren't useable, she said.

Here, she noted, the agency is working to identify existing standards and gap areas as well as support community-based efforts to develop standards.

Similarly, to facilitate data analysis, NIH is working on a Software Index that would work with, though be separate from, the DDI. Additionally, it is focusing on cloud-based analysis as well as data compression, visualization, and provenance methods.

On the institute level, Warren Kibbe, the director of the biomedical informatics and information technology center at the National Cancer Institute, said that NCI, which is grappling with the large volume of data generated by The Cancer Genome Atlas and the Therapeutically Applicable Research to Generate Effective Treatments project, is embarking on a pilot project to use the cloud to analyze the reams of big data.

Currently, datasets are still often downloaded to local computer workstations for analysis, but as datasets get larger and begin to include more orthogonal data, downloading time and storages space for the data become problematic.

"The problem with that is that it just doesn't scale," Kibbe said. He noted that it would take 23 days to download the 2.5 petabytes of TCGA data at 10 gigabytes per second.

For the NCI Cancer Genomics Cloud Pilots, both the data and analysis tools would be co-located in the cloud, removing the need for downloading and locally storing such vast datasets.

NCI also has similar efforts such as the Genomics Data Commons, a repository of NCI-funded genomic research, Kibbe said.

Still, much of the 'how' of data sharing, Bourne noted, is still being hashed out, and NIH is seeking input from the community on what it can do to facilitate sharing.

The Scan

Drug Response Variants May Be Distinct in Somatic, Germline Samples

Based on variants from across 21 drug response genes, researchers in The Pharmacogenomics Journal suspect that tumor-only DNA sequences may miss drug response clues found in the germline.

Breast Cancer Risk Gene Candidates Found by Multi-Ancestry Low-Frequency Variant Analysis

Researchers narrowed in on new and known risk gene candidates with variant profiles for almost 83,500 individuals with breast cancer and 59,199 unaffected controls in Genome Medicine.

Health-Related Quality of Life Gets Boost After Microbiome-Based Treatment for Recurrent C. Diff

A secondary analysis of Phase 3 clinical trial data in JAMA Network Open suggests an investigational oral microbiome-based drug may lead to enhanced quality of life measures.

Study Follows Consequences of Early Confirmatory Trials for Accelerated Approval Indications

Time to traditional approval or withdrawal was shorter when confirmatory trials started prior to accelerated approval, though overall regulatory outcomes remained similar, a JAMA study finds.