CHICAGO – The European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) hopes to facilitate research into SARS-CoV-2 and COVID-19 with a portal intended to serve as an information clearinghouse for sequencing data and literature on the novel coronavirus and related respiratory disease.
The COVID-19 Data Portal launched last month on existing EMBL-EBI bioinformatics infrastructure, including the European Nucleotide Archive. It connects to the outside world via the European Open Science Cloud.
EMBL-EBI activated the portal with datasets from six of its own resources: the European Nucleotide Archive, UniProt, the Protein Data Bank in Europe, the Electron Microscopy Data Bank, the Expression Atlas, and Europe PubMed Central. The portal has since added resources including the Human Protein Atlas and a special early release of UniProtKB/Swiss-Prot data on SARS-CoV-2 protein sequences.
The Cambridge, UK-based institute is in the process of adding data from new SARS-CoV-2 genomic sequences that users can upload themselves. Also in the works is a dedicated cohort browser for searching clinical and epidemiological data, according to EMBL-EBI.
"Science, public health, and healthcare have to work together if we want to minimize the impact of the COVID-19 pandemic," Marion Koopmans, head of viroscience at Erasmus Medical Centre in Rotterdam, Netherlands, said in a statement. "We are hoping that this initiative will enable researchers, clinicians, and public-health workers to safely and efficiently share their data in order to come up with answers to the most pressing questions about COVID-19," added Koopmans, a COVID-19 Data Portal collaborator.
The initial six datasets contain hundreds of sequences of the SARS-CoV-2 coronavirus, including gene expression and structural data. The collection also includes indexes of a literature corpus that is growing daily.
"This is only the beginning of this whole avalanche of data which will come out of many, many worldwide efforts," said EMBL-EBI Director Rolf Apweiler.
The COVID-19 Data Portal includes analysis and visualization tools for interpretation of the raw data. Registered users also can apply their own bioinformatics software in the EMBL-EBI environment.
The data portal is part of EMBL-EBI's European COVID-19 Data Platform, an effort launched in March to build a series of data hubs that will organize the flow of SARS-CoV-2 sequencing data and facilitate information sharing among researchers worldwide.
This European COVID-19 Data Platform is one of 10 "priority actions" specified in the European Commission's initial ERAvsCorona Action Plan, a strategy developed in late March and April when several European countries were epicenters of the outbreak.
The platform is a joint effort between EMBL-EBI, the European Life-Sciences Infrastructure for Biological Information (ELIXIR), and the European Commission. Initial academic collaborators include Erasmus Medical Centre, the Technical University of Denmark, Eötvös Loránd University in Hungary, the Dutch National Institute for Public Health and the Environment, and Universitaetsklinikum Heidelberg in Germany.
The data portal grew out of platform-related discussions.
"The commission heard about what we are doing and said, look, we are planning something Europe-wide. What can you do and how can we help?" Apweiler said.
Apweiler said that EMBL-EBI had no specific funding from the European Commission for the portal, but was piggybacking on some administrative and technical work that the EC has backed.
EMBL-EBI runs the major European databases for sequences, expressions, structure, literature, proteomics, and related biomedical information. "In this way it was easy to look he for the COVID-related datasets," Apweiler said.
The institute already had funding from the European Union for projects related to pathogens in foodborne diseases and animal diseases.
"We had already a lot of infrastructure which was in the past funded by the commission already in place and we had only to repurpose it," Apweiler said. "We could re-use the mechanism we had, these so-called data hubs, and same for the pathogen portal, which led to the COVID-19 portal."
The data hubs allow researchers to bring their own sequencing data to process and analyze on the EMBL-EBI platform. These hubs make it easy for researchers to share their analyses with 800 key databases worldwide, including the National Center for Biotechnology Information in the US and the DNA Data Bank of Japan, groups that the European institute has had two-way electronic connections with for decades, according to Apweiler.
"All this COVID human genetics work is really global, not just European," he noted.
EMBL-EBI sent a questionnaire to the EU and its member states to get a sense of national-level datasets and data generation related to COVID-19, and results came back late last month.
"Based on this and on our many discussions we had with the commission, we are building this priority list and mapping out how we want to roll [the portal] out," Apweiler said.
Virus sequences will be high priority, as will genetic information about human hosts. Much of the latter is coming from the European Genome-phenome Archive (EGA), which EMBL-EBI runs in collaboration with the Centre for Genomic Regulation in Barcelona, Spain.
Longer term, Apweiler said that EMBL-EBI is looking at incorporating results of studies of how chemical molecules bind to proteins.
Apweiler said that EMBL-EBI already has centralized data on the novel coronavirus and expects to collect significant amounts of information on human hosts. "This will be genotyping of infected people or exome sequencing or whole-genome sequencing," he said. The genotyping information will reside in the EGA.
EMBL-EBI is already working on improving its federated data system across the continent to help the institute manage information from national databases in the EU's 27 member states, as well as cross-border collaborations.
"And then there is a lot of data which will never make it into our database, clinical trial data and such," Apweiler said. EMBL-EBI will point users to those datasets via the portal.
Apweiler noted that the institute aims to work with European infrastructures, as well as national infrastructures, either directly or through the European network. "I think we are important part of the solution and we can build a very nice central point for people to start their search for good information, but it's not the one and only one," he said.
Eventually, Apweiler would like to see the COVID-19 Data Portal be an information clearinghouse for researchers across Europe and beyond.
"At the moment, we are not even showing all the datasets we have at EBI because we thought it was more important to start showing things than waiting until it's all there," he said, adding that EMBL-EBI will "constantly" add new datasets as they become available.
The institute has not yet set up any formal benchmarks, performance indicators, or targets for the COVID-19 Data Portal.
"What we are presenting helps people to make the right decision in setting up experiments and hypotheses," Apweiler said. "We want to make sure that we prioritize that which has the highest impact for our end users." He did not elaborate.
Soon, Apweiler hopes to have "tens of thousands" of coronavirus sequences submitted to the data portal and users from across Europe and worldwide.
"We can't do it all on our own. We will not have all the tools to serve people, but we want to be a data delivery vehicle for people to offer vendors and scientists more than we can," Apweiler said. "We see us as facilitators, not necessarily as the endpoints."