CHICAGO (GenomeWeb) — A broad, multi-continental coalition of biomedical research institutions, universities, and software developers hopes to build a technology infrastructure to create a diverse virtual cohort of population-level genomic data for global research and analysis.
The recently launched Common Infrastructure for National Cohorts in Europe, Canada, and Africa (CINECA) project is a four-year effort to harmonize ontologies and create a secure, federated network for international data sharing. Led by the European Molecular Biology Laboratory's European Bioinformatics Institute, the 22 participating organizations in CINECA expect to be able to share data on on 1.4 million individuals with accredited researchers worldwide.
The European Union's Horizon 2020 Research and Innovation Programme and the Canadian Institutes for Health Research are funding the effort, which will look to extend work underway for other genomic data projects, including the Global Alliance for Genomics and Health (GA4GH). According to EMBL-EBI, the project largely will follow the privacy practices of GA4GH, by which individual researchers will have to apply individually for access credentials.
"To make sure the work we're doing is as widely useful as possible, we'll be working through the GA4GH workstreams proposing these as standards and taking feedback," said Jonathan Dursi, senior research associate in the Centre for Computational Medicine at The Hospital for Sick Children in Toronto. Dursi also is architect and technical lead for the Canadian Distributed Infrastructure for Genomics (CanDIG), which is underpinning CINECA by building technology for federated queries and genomic analysis across locally controlled stores of health data.
"The things we're building won't just be one-offs for this project, but will actually be things that other groups can use and then instantly be interoperable" with projects like CanDIG or other CINECA participant sites, including the European Life-Sciences Infrastructure for Biological Information (ELIXIR) and the European Genome-phenome Archive, Dursi said. He said that the CINECA team will integrate the new cohort infrastructures into GA4GH as well as the ELIXIR artificial intelligence engine.
GA4GH and ELIXIR are among the many CINECA participants that have previously established relationships, and leaders are continuously looking for new partners.
"What we've realized, though, is we've got a lot of other links with other groups. Collectively, hopefully, we can have an impact and certainly want to have the message get out that we really do encourage people to get in touch with us," said CINECA's Canadian lead, Fiona Brinkman, a professor of molecular biology and biochemistry at Simon Fraser University in Burnaby, British Columbia.
The initial scope of work for CINECA includes 13 cohorts, including the Canadian Healthy Infant Longitudinal Development (CHILD) study, which is collecting data on 3,500 healthy children from birth to age 8. "We haven't even integrated most of it, but it's going to be way over 20 million data points, everything from microbiome to epigenome to genome to information about how often they keep their windows open at night for the children — the environmental data," Brinkman said.
Researchers then will look at health outcomes from that data, particularly outcomes related to allergy and asthma, Brinkman said. "Obviously, you can use this cohort to investigate other issues that are of concern [like] factors influencing development of obesity, [or] looking at the impact of the microbiome in these children on their health," she said.
"The idea is to look at these cohorts doing a meta-analysis of what are the common terms and data types being collected, what are different, where can we harmonize," Brinkman added. "For example, with the CHILD study, you can have rare instances of something. If this is occurring in another cohort, then all of a sudden you've got enough ... participants with that condition that you can actually start to make more statistically sound inferences about either associations with that or maybe even the causative associations based on further downstream analysis."
CINECA has a series of "work packages" around federated data discovery. The first few are looking at whether the participants can create interoperability of user authorization and authentication between Europe and Canada. Brinkman said that the contract requires these issues to be worked out by mid-2020.
A future technical work package calls for software to enable "fairly lengthy computationally intensive analysis" on virtual cohorts of patients with similar disease phenotypes and genotypes, Dursi said. Additional challenges include following the Ethical, Legal, and Social Implications (ELSI) framework.
CINECA is a research-focused effort, but because participants will federate data through application programming interfaces, the infrastructure can be set up to handle clinical care data in the future, Dursi said. "Every access of the data, every lookup can be audited and logged and authenticated and authorized, and [you are] building from that a federated fabric by which you can ask more and more complex questions and perform more and more complex analyses," he said.
But none of that will be possible without standardizing the ontologies. In the case of analyzing outbreaks of infectious diseases, someone might report getting sick from eating leafy greens, while another reports an illness from eating lettuce. "The computer doesn't know a leafy green and lettuce is the same thing," Brinkman explained.
"Of course we can build all the software for that in the world that we want, but if there aren't people like Fiona's team building the common ontologies ... it wouldn't matter at all if the data was written in different ways and it was just a Tower of Babel," Dursi said.
To this end, Simon Fraser University adjunct professor Will Hsiao is developing methods to sort through those discrepancies and make relevant links.
"We can think of it as essentially creating a universal translator," Hsiao said.
"My teams are building not just a vocabulary but also the tools that will allow the translation to occur automatically and with flexibility, in other words, creating the format not just for machines to understand, but also for humans to understand," Hsiao said. "Together, we can curate the knowledge both using machine-learning approaches but also with human curators."