Skip to main content
Premium Trial:

Request an Annual Quote

Q&A: Stanford's Atul Butte on the Importance of Open Data for Biomedical Research


Atul Butte, an associate professor in Stanford University's School of Medicine, was recently named as one of 13 Open Science Champions of Change by the White House's Office of Science and Technology Policy for his efforts to use and support open data in biomedical research.

Butte's laboratory at Stanford builds and applies tools to molecular, clinical, and epidemiological data including genetic, genomic, and phenotype information to develop diagnostics and therapeutics, and obtain new insights into disease. He's helped launch a number of commercial startups including Personalis, which provides clinical interpretation of whole genome sequences, and NuMedii, which uses public data for drug repositioning (BI 12/2/2011).

Butte, who also heads Stanford's systems medicine division and directs the Center for Pediatric Bioinformatics at Lucile Packard Children's Hospital, recently talked to BioInform about the importance of open data to his research efforts.

What follows is an edited version of the conversation.

Congratulations on being selected as an Open Science Champion. When did you find out you'd been nominated?

I think the nominees heard a week or two weeks before the actual award was given so it was all pretty last minute. I'd heard I was being nominated but my understanding is that thousands of people were nominated so I really had no clue.

Were you being honored for a specific project you worked on or for your work in general?

My understanding was that it was my research work in general. A lot of my work revolves around driving scientific and medical discoveries from open data, especially open molecular, genetic, and epidemiological data.

You've been involved in a lot of interesting research, started at least three companies, and done much more. I think it's safe to say that you're quite well known in the community. Still, I think it would be good if you could recap a few of the projects you've worked on.

From the research perspective, most people probably know me for a couple of things. The first has been our idea of using open data to discover new uses for drugs. What we showed back in 2011 was that you can use public molecular data on drugs and … diseases and marry these … to come up with new uses for drugs. We've shown this process works in a couple of published examples, and you'll be seeing more of these coming out in papers. At a certain point, we started to realize that there also has to be a business case for having these drugs reach patients, otherwise there would not be enough funding to justify clinical trials to get the new labeled indications. Because of that, Gini Deshpande and I started NuMedii, which you reported just got funded last week as well. So this is one example of using public big data. Data is a frozen resource by itself, but then you have to add value and energy to it, interpret it, and reanalyze it, to figure out how to get knowledge and action from that data.

I'm also known for research in the medical interpretation of genetics data. Here, the main source of our interpretive power has been curating what is in publications. This is getting a little easier as more genetics findings appear in open access publications. We use this knowledgebase to medically analyze human genomes. As you know, John West, Euan Ashley, Mike Snyder, Russ Altman, and I started a company called Personalis based on our experience with analyzing genomes. In my lab, we have continued using our genetics knowledgebase to study when in human migration or [the] evolution [of] disease variants might have appeared. Erik Corona, a former graduate student in my lab, recently showed how he could reinterpret the open access Human Genome Diversity Project genotype data from the disease perspective, and “mashed” all of that with Google Maps.

Another source of data which is really not public yet — but needs to be — is de-identified clinical data. We've been studying in my lab how we can conduct basic science on human physiology using all of these clinical measurements health care providers are now making in hospitals. We have shown how we can study pain and disease progression using these clinical measurements.

But the biggest thing I've taken on, since January, is this resource called [the Immunology and Data Analysis Portal] (BI 2/8/2013) from the National Institute of Allergy and Infectious Disease, which we are working on with Northrop Grumman. I am now the principal investigator of one of these portals releasing data. And what I’ve already learned is that the next big open data that I think is going to change the medical world is clinical trials data. A lot of journals, as well as some funding agencies, have been calling for the release of all clinical trials data. So many clinical trials fail, and when they fail, there is often no publication or record of the outcome of the trial. And when trials succeed, it’s often not clear how well they succeeded. At ImmPort, we have released more than 50 studies, many of which have raw clinical, cellular, or molecular measurements [from clinical trials]. Now, independent researchers can come in and reproduce findings, meta-analyze similar trials, discover new biological findings, innovate new biomarkers to subset patients responding to drugs — and maybe file patents and start companies on those, — and maybe even develop new decision-support apps for physicians from clinical trials data — all from open data.

Do you have any other interesting projects planned or perhaps projects you’d be interested in working on in future?

I think we're going to have our hands full with ImmPort. But I am also encouraged that the National Institutes of Health is going to be releasing more opportunities for all of us from [its] Big Data 2 Knowledge initiative, BD2K.

You've said you leverage open data a lot in your research and clearly there's a lot of support from the biomedical community to make data more open. Is there more than can be done to bolster those efforts?

A lot more could be done. The clinical trials data is one example of what we are trying to do to release more data. That’s a kind of data that really has been opaque to the public, and that needs to change. It's still not the first habit of people to release their molecular data. A large amount of data is still not publically accessible. So I think existing types of data need to be better covered and released. But I'm an optimist. I do see a lot of data is successfully released; a lot of it is making it out there. Probably new types of molecular data need to be released, like imaging, functional brain data, proteomics, metabolomics, cellular data, clinical data, and others. But at the same time … we always have to be assured that we don’t inadvertently release protected health information. I think the challenges there will still remain, but I think they are surmountable.

Final thoughts?

I want to thank all the scientists that release their data. Even if a few of us are on stage receiving this recognition, I know that I am only enabled because so many scientists are doing what they are supposed to do. If they are funded by public dollars or submitting to top journals, many of them are indeed releasing their data. I also have to thank the many scientists and groups that have developed the tools and standards that enable data sharing, like MGED/FGED. My own science has been driven forward only because of these scientists that enable the repositories of open data.

The Scan

NFTs for Genome Sharing

Nature News writes that non-fungible tokens could be a way for people to profit from sharing genomic data.

Wastewater Warning System

Time magazine writes that cities and college campuses are monitoring sewage for SARS-CoV-2, an approach officials hope lasts beyond COVID-19.

Networks to Boost Surveillance

Scientific American writes that new organizations and networks aim to improve the ability of developing countries to conduct SARS-CoV-2 genomic surveillance.

Genome Biology Papers on Gastric Cancer Epimutations, BUTTERFLY, GUNC Tool

In Genome Biology this week: recurrent epigenetic mutations in gastric cancer, correction tool for unique molecular identifier-based assays, and more.