To build the infrastructure needed to support global sharing of clinical and genomic information, the technical arm of the newly minted international data sharing alliance intends to focus on establishing and enhancing open standards and formats for storing and representing the data as well as application programming interfaces that will connect analysis tools to the data.
David Haussler, a professor of biomolecular engineering at the University of California, Santa Cruz, told BioInform this week that these standards will also determine which cloud vendors will be selected as hardware providers for the alliance as well as the kinds of third party analysis applications that will be developed for analyzing and interpreting the data.
According to a white paper published this week by the alliance, the standards developed will focus on "a small number of critical, low-level genome analysis tasks" and "not attempt to build specifications and mechanisms for all types of genome interpretation and use cases."
They'll also develop "appropriate standards for sequence analysis and clinical data interpretation" that will help "identify algorithms and processes that provide the most accurate and informative results."
The paper also explains that rather than build a centralized data warehouse, the alliance has opted to invest in cloud facilities because they offer major benefits in terms of costs and scalability and are "well suited to the large-scale and dynamic computational needs of platforms."
"We anticipate that with investment in compression and computational efficiencies, the cost of active data storage for one million whole genome datasets could be reduced to [approximately] $50/genome/year by 2014," they wrote. "Using archival storage … the storage cost could drop by 10x, and is likely to drop further." This is compared to $100/genome/year, the current storage costs for the Cancer Genomics Hub — a cancer data repository maintained by UCSC (BI 5/4/2012) — which uses custom built local infrastructure, the paper states.
A decentralized approach also lets the alliance honor participating countries' data handling regulations, Haussler said, and it also addresses some of the limitations of centralized infrastructure — namely, a lack of redundancy needed for security purposes and inflexibility in terms of data distribution.
Also, the alliance plans to engage the services of multiple vendors, Haussler said, to ensure "fair access and optimum pricing" and avoid a monopoly where a single vendor controls access to all the data. Also, many vendors already offer cloud infrastructure in multiple countries and have adapted their systems and services to comply with local data handling regulations, the white paper notes.
Right now, the alliance does not have a fixed timeline for when it plans to have defined standards and infrastructure in place. "There are a lot of people who have just joined the alliance and we need to give everybody a voice … so we'll be sorting that in a couple of months as the alliance takes shape," Haussler said.
So far, nearly 70 healthcare, research, and disease advocacy organizations have joined the group. Each organization has signed a letter of intent pledging to create a not-for-profit, public-private, non-governmental organization — modeled on the World Wide Web Consortium's approach — that will develop a common framework.
The list of signatories includes BGI-Shenzhen, the Broad Institute, the European Bioinformatics Institute, Memorial Sloan-Kettering Cancer Center, the National Cancer Institute, the National Institutes of Health, the New York Genome Center, and Wellcome Trust Sanger Institute.
The alliance has its roots in a meeting that was held in January this year. A group of 50 colleagues met to discuss current challenges and opportunities in genomic research and medicine and how best to meet the needs of the patient, research, and clinical communities.
As a result of that meeting, they concluded that it would be necessary to form a global alliance that would be responsible for creating and maintaining technical standards "for managing and sharing sequence data in clinical samples, developing guidelines and harmonizing procedures for privacy and ethics, and engaging stakeholders … to encourage responsible and voluntary sharing of data and methods."
In addition to a technical working group, the alliance also intends to set up groups that will address regulation, law, and ethics; clinical and phenotypic data; and public engagement. It will seek funding from a variety of sources including philanthropists, research grants, and member dues.