CHICAGO – Bioinformaticians at the Center for Genomic Regulation (CRG) in Barcelona, Spain, this month introduced a platform that allows scientists worldwide to analyze raw and consensus COVID-19 sequencing data to compare genomic, proteomic, structural, and motif variability of the SARS-CoV-2 virus.
Called the COVID-19 Viral Beacon, the resource gives researchers a single portal to search for genetic variants and associated metadata among a collection of nearly 70,000 viral sequences from sources including the European Nucleotide Archive (ENA), Oxford Nanopore Technologies, Illumina, the US National Center for Biotechnology Information's Sequence Read Archive (NCBI/SRA), and the Global Initiative on Sharing All Influenza Data (GISAID).
COVID-19 Viral Beacon is a web interface that allows scientists to query components of viral genomes, both the raw reads and consensus sequences.
CRG, part of the Spanish branch of the European Life-Sciences Infrastructure for Biological Information (ELIXIR-Spain), calls Viral Beacon a "one-stop shop" for researchers to look for specific genetic variants and examine associated metadata, for example to study viral strains in various regions of the world. The interface was designed to search from mobile devices as well as computers.
It supports queries on SNPs, indels, annotations, short motifs, and amino acids.
COVID-19 Viral Beacon is built with the Beacon application programming interface (API), an open-source variant search protocol developed by the Global Alliance for Genomics and Health (GA4GH). ELIXIR has long participated in the development of the Beacon API.
It is not the first coronavirus resource to use the Beacon API, though. Early in the pandemic, DNAstack introduced an app called Beacon for SARS-CoV-2 to enable the scientific and medical communities to share and discover knowledge about the genetics of the virus in real time.
However, the Barcelona platform is much larger, containing 70,000 sequenced viral samples now, about 76 percent of which come from GISAID. The DNAstack app, by comparison, is working from a database of fewer than 25,000 sequences.
For COVID-19 Viral Beacon, CRG took consensus data from GISAID and ENA, though ENA also supplied about 11,000 raw reads from Illumina and Oxford Nanopore sequencers. CRG is analyzing and calling variants in the raw reads to add some context. "This is new data that is not available elsewhere," said Jordi Rambla de Argila, team leader of the European Genome-phenome Archive (EGA) at CRG.
And because GISAID is only supplying sequences, not any reference to which mutations have been mapped, CRG is applying its pipeline to that data as well.
While COVID-19 Viral Beacon can help researchers uncover variants that signify new mutations, Rambla described the platform as more than a search engine.
Visitors to the website can download their search results and the data behind those results, then continue their own analysis from that point. Often, this helps confirm hypotheses, according to Rambla. "If you have a suspicion of something, you can go to the website and double-check what is there," he said.
The project came about back in March when Spain emerged as an early hotspot for COVID-19. "We felt compelled to go and contribute," Rambla said. For CRG, this meant offering bioinformatics expertise.
The Center for Genomic Regulation is located in the Barcelona Biomedical Research Park, which is adjacent to the Hospital del Mar de Barcelona. The hospital's Medical Research Institute, known by its Catalan acronym, IMIM, is in the same building, and the hospital itself added beds in the Biomedical Research Park when COVID-19 surged early in the year.
"The context was pushing everyone to do as much as they can with COVID," Rambla said. For CRG, that meant teaming with the EGA to adapt Beacon for viral genomes.
That alone represented a challenge because CRG has experts in human genomics but not in viral genomics. However, Rambla has a background in phylogenetic and evolutionary analysis.
"When we start to look at the genomic information that is around in the SARS-CoV-2 domain, we saw that everything is coming from the epidemiology and very little is coming from the genomics world," such as variant calling. "We thought, can't we do something to help better discover the contents of the genomes of the virus in this moment?"
The CRG team decided to apply the principles of Beacon version 2, which Rambla co-leads for GA4GH, to viral sequencing. "We got the public information that we could and we started building a Beacon on top of this information that was available," Rambla said.
"To be honest, we are not sure that this could be a useful tool for the current epidemiologists because they tend to [work] with another kind of information," Rambla admitted, noting that epidemiologists generally do not deal with variants and diversity in viral genomes.
Until this month, CRG had only advertised the availability of COVID-19 Viral Beacon within communities like ELIXIR, more among bioinformaticians than biologists.
"We are starting to get feedback from biologists [now]," Rambla said.
Rambla said that his team is refining COVID-19 Viral Beacon pretty much every week as new research comes out. He pointed to a paper in Nature this month that examined the more than 12,000 known mutations in SARS-CoV-2 genomes.
While only the D614G mutation has spread widely, there is no scientific consensus on whether this variant is more virulent or faster-spreading than other forms of the virus, the paper said. Studying mutation patterns further can help inform researchers working on treatments or vaccines.
COVID-19 Viral Beacon, Rambla said, can help researchers quickly locate information on mutations in specific positions so they can understand the meaning of the changes. The site has changed as the knowledgebase has grown, notably adding a search for changes in amino acids in the coronavirus's spike protein.
"We are adding these features to make the discovery much easier and we are also adding more context to the query you are doing," Rambla said.
As with many other genomic data repositories, the GISAID data in COVID-19 Viral Beacon in particular is lacking context because the biobanks that collect samples hide some personal information of donors, including their age and sex. Also, GISAID's data licensing agreement prohibits the redistribution of certain elements of the dataset.
"This means that we need to think in how to do some features in order to not to break [the] agreement," Rambla said.
He was particularly struck by the lack of metadata available to COVID-19 Viral Beacon prior to the CRG processing. "The pressure for getting access to the data and doing analysis on the data is so big that it's shocking to me that we haven't gotten [more] on that," Rambla said. Conspicuously absent are any clues about race or ethnicity.
"You cannot infer anything about [whether a mutation] is more common in men than women or in Asian people or in African people. You're completely blind to this information," Rambla said.
If a sample is known to come from China, for example, it is reasonable to assume that the donor is Asian, but even that is not a certainty. The problem is more acute in Europe and North America, where populations are far more diverse.
Rick Stevens, associate laboratory director at Argonne National Laboratory in Lemont, Illinois, called COVID-19 Viral Beacon "certainly useful" to the research community, but noted the same shortfalls of the GISAID dataset.
Stevens was especially struck by the Oxford Nanopore and Illumina pipelines to help overcome some of the limitations of the GISAID data, which lacks a lot of metadata and does not allow the redistribution of certain demographic characteristics of sample donors. "What they've done is integration of the raw data with the assembled data, which is really useful," Stevens said. He also praised the graphical display on the COVID-19 Viral Beacon website.
For its part, CRG is now beefing up statistical information on the website, including infection breakdown by patient sex, age range, and country when available, complete with search capabilities. "Instead of giving you a set of numbers saying this is the number of mutations with a raw list of where you find it, we will offer you some more statistical context," Rambla said.
He said that they are also looking at presenting information on indels in viral genomes rather than just single-nucleotide variants. "Sometimes it's very relevant because it is what is making a protein functional or not functional or extra-functional," Rambla said.
In a statement, EGA team member Babita Singh said that the CRG is now looking to collaborate with others in the field to make Viral Beacon "a quick, go-to genomic variants search tool for COVID and other infectious diseases." Singh expressed a desire to add human genomic variants in the future to support research into the interactions between human and viral genomic factors.
"This is a useful resource," said Argonne's Stevens. "Is it going to give us something that's impossible to get any other way? No, but it's good that they're doing it and they're making the data that they can distribute available."