BOSTON (GenomeWeb) – By the fourth quarter of 2019, participants in the National Institutes of Health's All of Us Research Program will be able to see their own sequencing and blood assay data online.
Since All of Us is collecting samples and health data from 1 million people at healthcare facilities all over the country, the only way this information dissemination will work is because NIH and its partners are standardizing the results according to the Observational Medical Outcomes Partnership (OMOP) Common Data Model. All of Us also is normalizing phenotypic information on the Substitutable Medical Apps, Reusable Technology (SMART) on FHIR framework, based on the Fast Healthcare Interoperability Resources (FHIR) standard.
In a keynote address to open the annual Bio-IT World Conference & Expo here yesterday, John Wilbanks, chief commons officer at Sage Bionetworks, was clear about his preference for those standards to promote interoperability.
"Choose OMOP or SMART on FHIR and don't choose anything else," he said. The openness of standards and of data itself is key, according to Wilbanks, a longtime advocate of open data.
Wilbanks called open science a "suite of methods" that can raise the confidence level that researchers have in scientific claims and talked about "justification" of science.
"If there are two [journal articles], we don't know which one of them has claims that are more scientifically justified than the other one because one of them may have run the same experiment 150 times and it worked once and one of them may have run the same experiment 150 times and it worked 150 times in a row," Wilbanks explained in a separate interview with GenomeWeb.
"One of those in my opinion is much more justified than the other, but the process of publishing it collapses all that," he said. "Making the data available, making the claims that we're going to be investigating available … we evaluate how justified that paper is."
At Bio-IT, Wilbanks showed a collage of photos of Labradoodles and fried chicken that resulted from an online photo search for Labradoodles, humorously illustrating how working with a limited data set can create inadequate algorithms. The search engine apparently had been trained to spot the designer breed of dog based more on color than other factors such as fur patterns, shape, and facial characteristics.
"It's a data problem," Wilbanks said. The same thing often happens when machines try to find genomic variants of nonwhite people since research pools in the US and Western Europe tend to be heavily skewed toward white subjects, according to Wilbanks.
In this sense, he said, open data is a tool to fight for justice. "We don't just need more science. We need more 'just' science," Wilbanks said, noting that All of Us is striving for a more diverse research cohort.
Open data also potentially could change the paradigm of researching for the purpose of getting published in peer-reviewed journals.
"If you've got an algorithm to analyze genetic cancer data, it would be really nice if you could test that against a benchmark that's an open benchmark for doing cancer data analysis. And if you can't beat the benchmark, you probably shouldn't publish," Wilbanks said.
"There are all of these things that open science can create that are not the artifacts of most modern academic science, which are papers," he said. This might include computer code, algorithms, or raw datasets.
Open science also can allow for informal peer review before research is even submitted for publication.
"It's good at benchmarking things like algorithms. It's good at dealing with things like biased training sets. It has all sorts of really important social goods, but it's not better, faster, cheaper," Wilbanks said. "It's just better."
With 2019 marking 10 years since Sage Bionetworks spun out of Merck's Rosetta Inpharmatics unit, Wilbanks said that the nonprofit is making a push this year to convince researchers to create "reusable data" in the same way that open-source software has led to "reusable" code.
"The old joke in my field has always been that that other people's data is like other people's toothbrushes. You don't want to use it," he told GenomeWeb.
Sometimes that is because scientists tend to strip out many of the insights before they report results, but often it is due to the fact that researchers do not have or will not make the time to annotate their data in a way that would make their findings more useful to others.
"Until, in my opinion, we figure out how to get machine learning and [artificial intelligence] to do that annotation for us, it's going to be really hard to have data get as reusable as open-source software is," Wilbanks said. "But we will eventually get there."