Researchers from the University of California, Berkeley; UC San Francisco; and UC Santa Cruz recently published a white paper in which they discuss the technical feasibility of building a data warehouse that could host 1 million patient genomes and related clinical and pathological data.
Using cancer as a model, the researchers detailed the warehouse's architecture and design, hardware costs, memory capacity, storage, data formats, and data compression techniques. They also explored issues of patient consent and data privacy, called for a revision of current procedures for getting access to data for research, and suggested some ways that such a system could be implemented in the US.
The researchers argued that a resource like this would benefit biomedicine because it would make data available to researchers in a single, usable format so that they don't have to search for it in multiple databases and repositories. It would also provide a large enough sample pool for identifying relevant disease mutations with sufficient statistical power, they said.
While they noted that there are social, economic, and technical challenges to implementing such a data warehouse, the paper is intended to "stimulate discussion" about the need for this type of resource and "what its nature should be," the researchers wrote.
This week, BioInform spoke with David Haussler, a professor of biomolecular engineering at UCSC and a co-author on the white paper, about some of the benefits of a comprehensive warehouse for biomedicine and the challenges of setting one up.
What follows is an edited version of that conversation.
What got you started on this idea of building a virtual warehouse that could host a million genomes?
We built the CGHub database for the National Cancer Institute. That database is designed to hold up to 50,000 genomes and covers their research projects that are currently in the queue or planned, including the Cancer Genome Atlas and the [Therapeutically Applicable Research to Generate Effective Treatments] projects.
But … we need to get beyond tens of thousands of samples. There are 1.6 million new cancer cases in the US every year. We need to start thinking about getting those data for research, which means giving people the opportunity to donate their cancer genomics information to science. It's like being a blood donor. People need to opportunity to donate this important genetic information about their cancers so that we can learn to cure this disease. Barbara Wold [a molecular biology professor at the California Institute of Technology] proposed this Cancer Information Donor idea when she was doing her recent service at the National Institutes of Health.
It sounds like this would be something for research use only and not something that could necessarily be used for diagnostic purposes.
The way we do research based on individual trials and centers has had the effect of siloing data into different consented groups in such a way that you can't aggregate the data together for big statistical analysis. That has inhibited science. If we continue going that way we lose a tremendous opportunity. However, even if we were able to somehow aggregate all of the research data, the cohorts are still small in both clinical trials and other organized research projects, whereas there are an enormous number of actual tumors that are treated in the course of clinical practice outside of clinical trials. If you really want to get the numbers up into the millions, you have to think about building an infrastructure that will be useful for both clinical practice and research.
That means that there are two different interfaces to the database. You secure the data for either set of compliance requirements (clinical or research) but in terms of who has access to it and how it's used, that would be different in clinical practice and in research. But there is no reason why there can't be an underlying data storage mechanism that is used in both cases.
Why not simply extend CGhub or some other existing database? Why build an entirely new warehouse?
One thing about CGhub is that it was built in the context of a specific government contract, and the specifications of it are to store the data in the most economical way, not to provide any computation on the data. You can get access to the data from the repository if you are authorized to do so, but then you copy the data to your own home computers to analyze it. But once you get to a million genomes, you will want to be able to go to where the data is and do your computation there. This means that there would have to be a cloud-based service where you could provide resources so people could do their computing onsite and then release those resources for others to use, without moving enormous amounts of data.
We estimate in the white paper that 1 million genomes [will be] 100 petabytes after compression. It really isn't practical to move 100 PB of data around. So it’s a different concept. You can't just add to or build on the existing databases.
So are you suggesting that this warehouse should replace these existing databases or supplement them in some way?
I would hope that this database would supersede some of the smaller databases. My dream would be that people would get together and decide to build a larger database and the smaller databases would join in.
My main goal here is to prevent the siloing of data such that we can't aggregate it for joint research. One way to do that is to create a system that may be federated … but unified in terms of the application programming interface to it. That’s the key. It's how you access the data. We learned this from the World Wide Web. As soon as there is one consistent protocol for exchange of information then you have a blossoming.
You go into this in great detail in the paper, but could you give me a big-picture view of the system you envision?
We would hope that the database would be able to store genomes in a secure way. We would hope that patients who are involved in studies or even in just routine clinical treatment eventually would have the opportunity to be consented … to donate their data to research … fully understanding what they are getting into. Their data would be put into a repository for use in research. There would perhaps be some different ways that the patient or medical center or clinical trial could restrict the uses of the data, but within limits. Researchers then would be vetted and have appropriate research access to the data.
It is all about statistical power. We learned a lot from genome-wide association studies. [For example] if you have 100 patients and 100 controls and you look at their genomes, there may not be any one difference that is statistically significant. And so you go back and you get 1,000 cases and 1,000 controls. This happened with schizophrenia, for example. Based on 1,000 cases and 1,000 controls, the researchers still didn’t see any smoking gun, [so one would think] maybe schizophrenia doesn’t have a much of genetic contribution and yet we know that it [has] from the way it runs in families … so what's going on? So you go back and you get 10,000 cases and 10,000 controls and then suddenly you [discover that] there is a difference between the genomes of the patients and the controls and it's statistically significant and then you start to make discoveries. That is what happens when you have the statistical power of a large number of samples.
Cancer is not one [disease], but thousands of different diseases; there are subtypes of subtypes. It is a very complicated pattern of mutations that determines the different subtype and determines the type of response that you are going to get to different therapies. If we don't have the numbers to assess these, it's hopeless. We have to have the raw numbers. So my passion right now is to try to create a way so that people can get together to get enough data so that we have the statistical power that we need in research.
I think there will be lots of benefits on the side for practice because it is possible that the database and related associated software mechanisms can also be used in practice.
Where would you propose something like this should be hosted and who would be responsible for managing it?
That's a complicated question. This requires a good deal more discussion. A lot more work has to be done to work that out. The purpose of the white paper was just to establish that it's technically feasible and estimate what it would cost, maybe $50 per genome per year, and of course to make the case that it would be a good thing to do. But obviously that’s just the tip of the iceberg. Enormous amounts of additional work have to be done to actually make this happen. I'm hopeful that there will be more discussion.
By the way, why did you decide to publish this as a white paper rather than in a peer-reviewed journal?
It's too long to be in a peer-reviewed journal. We may publish a shortened version of it eventually.
You've already said that there's still a lot of work to be done to really get a warehouse going, but what would it take to at least get something started?
A lot of money, people, technology and good will.
The big problems remain to be worked out. There's no question [that] this report sidesteps the real difficult issues, but we decided to start by proving that it's technologically feasible.
You focused only on cancer for this paper but do you think this model could be applied to other disease conditions?
Yes, I do. Cancer is the high watermark in the sense that it’s the most challenging in terms of the genomic data and analysis because you have not only the patient's germline genome but you have the first and second biopsy, the recurrence, et cetera. Further, each tumor is not one genome but a mixture of different clones that are growing within the same tumor. So the complexity of the DNA sequencing data in cancer is higher than in any other disease. You have more complicated DNA information in cancer than in any other disease. If you can build it for cancer then you should certainly be able to build it for other diseases as well.
One of the most frustrating things that’s happening now is that one group is building a genome database for diabetes research while another is building a genome database for Alzheimer's research and if you want to study some mutation or property of the genomes that occurs in both, you can't mix those two datasets, or at least you have to go through two very different procedures to get access to them. If you look through the dbGAP database, for example, because each dataset is controlled by a different data access committee with a different set of access criteria, to get a dozen datasets and analyze them all together, you may have to create different written applications for a dozen different data access committees. This is may lead to months of paperwork and delay. When you set up that big a bureaucratic structural obstacle, the result is [that] people just don't do that kind of research.