NEW YORK – A new European Union-backed project will seek to integrate and make available COVID-19 pandemic data sourced from a variety of disciplines, from genomics to public health and social sciences. The €12 million ($13.7 million) effort, called BeYond-COVID, or BY-COVID, commenced last month and will run through October 2024.
ELIXIR, the European life-sciences infrastructure for biological information, is coordinating the effort, which will build on the work of the COVID-19 Data Platform, a separate EU-supported undertaking that started last year and is led by the European Molecular Biology Laboratory's European Bioinformatics Institute, or EMBL-EBI.
Niklas Blomberg, director of ELIXIR and project coordinator for BY-COVID, said that the effort has several aims: It will support genomic surveillance of COVID-19 and other infectious diseases as well as connect genomics data with public health and real-world healthcare data.
Social science archives in Europe will also be involved, Blomberg noted. "We are looking at how we can connect at least some genomic aspects of BY-COVID with data from social science surveys and more societal impact data," he said. Once these datasets are integrated, BY-COVID would like to pair them with effective analysis and deployment tools, he added.
According to Blomberg, the objective of BY-COVID is to ensure that infectious disease data can be easily accessed and used, not only for the current pandemic but also for potential future pandemics involving other diseases. Because of that, bringing in real-world healthcare data, public health data, and other social science information is necessary because it allows researchers to gauge the pandemic response and aid the surveillance of infectious diseases. Such data can also be used to develop resources, to set new data standards and guidelines, and to better coordinate datasets from different sources in the future, he said.
Blomberg pointed out that BY-COVID is very much a complementary effort to the COVID-19 Data Platform. Like EMBL-EBI, Blomberg and the ELIXIR Hub are based at the Wellcome Genome Campus in Hinxton, UK, about 10 miles south of Cambridge. Blomberg described the COVID-19 Data Platform as having been a success, and that multiple countries have since established their data hubs, meaning that data maintained at the national level can be accessed globally via the platform's COVID-19 Data Portal.
Guy Cochrane, head of the European Nucleotide Archive at EMBL-EBI, also noted the complementarity between the aims of the European COVID-19 Data Platform and BY-COVID. One of the core components of the COVID-19 Data Platform has been the SARS-CoV-2 Data Hubs, which he described as a "toolbox for those generating and working with viral sequences." According to Cochrane, the platform has so far mobilized some 2 million isolates' worth of raw and assembled sequence data from around the world, and also provided variant calling and phylogenetic analysis.
As part of BY-COVID, EMBL-EBI will continue to operate components of the COVID-19 Data Platform, Cochrane said, including the SARS-CoV-2 Data Hubs, and will support data mobilization, processing, and interpretation of viral sequences and variation. The platform plans to extend the hubs in two ways, he added, by adding refinements and extensions to better support integration into national data management systems and by expanding the range of viral and host data types, such as from proteomics and metabolomics platforms.
The platform will also build a new component called the Preparedness Data Hubs, which will allow public health and research user communities to spin out hubs with high-throughput data processing, analysis, and interpretation tools for emerging pathogens. Within its COVID-19 Data Portal, plans are also underway to expand the indexing system to capture or link data resources relevant to COVID-19 that are not yet in the system.
Another ongoing project that BY-COVID seeks to align with is the Versatile Emerging Infectious Disease Observatory, or VEO. The EU awarded the VEO Consortium €15 million in 2020 and the project will run through 2024. VEO's aim is to provide data to inform early warning, risk assessment, and monitoring of infectious diseases. Several of the participants in VEO, such as the EMBL, Erasmus Medical Center, and Technical University of Denmark, are also participants in BY-COVID.
"The VEO project is being driven by the public health epidemiology pathogen community and has developed mechanisms where sequencing labs can spin out data hubs and environments where pathogen data can be easily analyzed in the context of all other publicly available pathogen data," said Blomberg. "We want to bring together developments in the COVID-19 Data Portal with developments in the VEO." Blomberg said that initial milestones for the project will be to show that SARS-CoV-2 Data Hubs can be deployed in different cloud or national data environments. "That is an important early deliverable," said Blomberg.
In Blomberg's view, one example where the availability and interoperability of these datasets is important is the increasing metagenomic surveillance of pathogens. "Of course if you don't have any reference data, it's impossible to do metagenomics-based surveillance," Blomberg remarked.
BY-COVID will also work with the Public Health Information Research Infrastructure, or PHIRI, to incorporate data from other disciplines, including public health, he said. PHIRI was funded by the EU last year, and the aim of the €5 million project has been to create a platform to make population health information related to COVID-19 available to researchers.
Enrique Bernal-Delgado, a senior health services and policy researcher at the Institute for Health Sciences in Zaragoza, Spain, who is involved in PHIRI, said the project is in the process of deploying a federated data infrastructure where sensitive health data remain under the control of the data holders, while algorithms are shared to respond to research questions on the effects of the COVID-19 pandemic.
Within BY-COVID, PHIRI will apply its experience in mobilizing real-world data, ranging from health and clinical information to genome data and data from randomized clinical trials studying the efficacy of SARS-CoV-2 vaccines and COVID-19 treatments.
"The use of such a variety of data origins implies an unprecedented exercise for PHIRI that will challenge the methodologies applied to the reuse of real-world data," underscored Bernal-Delgado. He said the project is determining how to best integrate these diverse datasets, and that the solution is "not obvious at all."
"We need to understand what data is available, what [data are] accessible, and with what level of granularity," he said. "Then, whether data linkage is possible, what is the representativeness of the final sample, [and] what is the quality of the resulting data," he said. "All these preliminary questions are now on the table and will be part of the upcoming work."
A pathfinder project
The BY-COVID project is built upon four pillars, according to the project description. The first is mobilizing data by ensuring that raw sequencing data can be submitted to important data hubs, such as the European Nucleotide Archive and the Federated European Genome Archive. The second is to build tools to link sequencing data and metadata, including public health and economics data. This will include digital tools for data analytics, tracking new genomic variations of SARS-CoV-2, and flagging new variants of concern. The third pillar is to standardize data by encouraging findable, accessible, interoperable, and reusable data standards and interoperability among resources, and the fourth pillar is to expose and analyze infectious disease data, such as via the COVID-19 Galaxy platform.
A key component is the Federated European Genome Archive, which hosts locally stored, centrally indexed human datasets relating to COVID-19. Cochrane said the platform has five national nodes in test operation and will soon move to production status. About 500,000 patients or research subjects are already in the EGA, he said. There is, of course, the COVID-19 Data Portal, too, which provides access to 5 million records across sequencing, structural biology, proteomics, chemistry imaging, and other literature. The portal also supports a network of 10 national-level COVID-19 data portals, so that researchers can use them on data kept in country, Cochrane noted.
According to Blomberg, BY-COVID will also develop a tiered indexing engine that will make it easier to navigate the different data collections that are being linked. This will support access not just to data that can be integrated, such as biomolecular data, but also, for example, social science survey data. "We are not trying to fully integrate all of the data, but it might be possible to link samples from individuals to survey responses, geographical areas, or the temporal dimension," said Blomberg. The envisioned engine will be multifunctional, he added, and building it will involve computational experts from diverse research fields. He added that being able to demonstrate the ability to link data across countries will be an important outcome, though BY-COVID will seek to showcase its feasibility, not to fully link those datasets.
Finally, as part of BY-COVID, the partners will drive stakeholder engagement to ensure that its platform continues to "develop appropriately for its user communities and to reach those in need of scientific data in support of informed COVID-19 and emerging pathogen responses," Cochrane said.
Ultimately, BY-COVID aims to serve as what Blomberg called a "pathfinder project," creating a basis for future data sharing and allowing researchers, including those in industry, to access datasets to better understand infectious diseases. "We will show that it is possible and that it has benefits, but we are not building an operational infrastructure for half a billion people in Europe in 27 countries with €12 million," he remarked.