Skip to main content
Premium Trial:

Request an Annual Quote

Australia's CSIRO Looks Beyond Spike Protein to ID New COVID-19 Variants to be Monitored


CHICAGO – Bioinformaticians in Australia are calling for a change in how epidemiologists monitor and predict new COVID-19 variants, using analytics software to sort through entire viral variant genomes rather than simply looking for changes in the gene encoding the SARS-CoV-2 spike protein. The researchers hope that this technology could underpin a future global infectious disease surveillance network.

In a recent article in the Computational and Structural Biotechnology Journal, a team from the e-Health Research Centre of Australia's Commonwealth Scientific and Industrial Research Organisation (CSIRO) noted that all variants being monitored (VBMs) since the start of the pandemic in 2020 have concentrated on the spike protein. "However, there is evidence that other regions of the SARS-CoV-2 virus may also have an impact on clinically relevant properties," the researchers wrote.

Denis Bauer, bioinformatics group leader at CSIRO, said that strains like BA.5, which is currently dominant in the US, is changing the outlook for public health officials worldwide, who have predominantly been looking for changes in the spike protein when defining variants of concern or variants to be monitored. 

"This additional step of first grouping things into clades and then looking for pathogenicity may be not the best approach" Bauer said. "We should be looking for pathogenicity of the individual locations in the genome because the virus doesn't care, either, what its family tree is. It just mutates randomly."

To uncover single-nucleotide variants associated with disease severity and other factors, the CSIRO group ran a genome-wide association study, applying its VariantSpark computing method, which is capable of analyzing complex, polygenic phenotypes that may involve epistatic interactions using massive sets of whole-genome data.

VariantSpark is a cloud-based, distributed, machine-learning computational framework that achieves scale through multilayer parallelization. The software can automatically detect SNVs across entire genomes and identify epistasis even when mutations have limited or no functional effects.

An earlier, logistic-regression GWAS by the Global Initiative on Sharing All Influenza Data (GISAID) only found one locus of significance. "Since VBMs are characterized by multiple SNVs, it is hence more likely that multiple loci in the viral genome evolve together to modulate its pathology, and such an outcome would have been expected in a GWAS study," the CSIRO researchers wrote.

For this new experiment, CSIRO mined the GISAID database of nearly 3.4 million SARS-CoV-2 sequences to assemble a case-control dataset of 10,520 SARS-CoV-2 samples with annotations that indicated patient outcomes ranging from asymptomatic cases to death. While this set represents less than 1 percent of all the GISAID sequences, the CSIRO researchers said it was the world's largest case-control dataset for VBM detection. 

The investigators built clade-independent VBM definitions with the help of epistatic interaction detection software BitEpi before applying VariantSpark analysis to identify 117 mutations with "significant association" to patient health outcomes, including 70 not found in earlier research.

CSIRO focused on 31 previously unmonitored pathogenic mutations. Among those, 29 had an allele change to "N," according to the paper. "This indicates that any move away from the original Wuhan strain has an influence on the disease outcome," the researchers wrote.

They homed in on the translation inhibition activity of the exonuclease domain of NSP14, a protein that helps the coronavirus evade antiviral responses in infected people. "Using genomic, health, structural, and molecular data, our study provides further evidence supporting the importance of this region as an attractive therapeutic target for SARS-CoV-2," the authors wrote.

For their COVID-19 research, CSIRO's team set VariantSpark to provide hourly updates of any new findings, a key element in offering timely warnings of novel variants to assist public health authorities and healthcare systems in adjusting epidemiologic response strategies.

To validate their findings, the CSIRO investigators followed the work of a Yale University team, published in the Proceedings of the National Academy of Sciences last year, that looked at immunological implications of both the SARS-CoV-2 spike protein and NSP14. Bauer said that her team got in touch with the Yale researchers to tell them that they found a mutation adjacent to one the Yale group had mutated in vitro.

"It's very likely that it has the same actual mechanism," Bauer said of the two findings.

For CSIRO, the next step is to periodically reanalyze the dataset to validate the functional outcomes they are able to identify. Bauer would also like to group the findings not by individual mutations, but by function and pathogenesis to create groups of characteristics, such as those that affect the lungs and those that affect the immune system.

"We [want to] say this virus that we observed has functional group X and therefore we need to treat it in a certain way, or we need to monitor it in a certain way," Bauer said.

Looking forward, CSIRO would like to add in analysis of crystallographic structures aided by models of protein complexes. "Our method of identifying single mutations and 2-, 3- and 4-SNV combinations that significantly affect patient outcome and are supported by protein modeling predictions may offer a streamlined approach to quickly flag dangerous mutation combinations and has the potential to supplement current variant surveillance efforts," the authors wrote. "Future work should include in vitro assays assessing functional consequences of the novel mutations identified in this study."

In positioning VariantSpark to support global disease surveillance, Bauer referred to a recent article in the Economist by John Bell, chair of Genomics England's scientific advisory committee, in which the Canadian immunologist and geneticist called for a global public health surveillance network that includes genomic data, patient outcomes data, and analytics tools.

The next "concrete" step for Bauer is to collaborate on a paper with wet labs to illustrate how VariantSpark performs as an in silico analysis method. That, she said, would add weight to the argument that VariantSpark could support the kind of global network that Bell envisions.

However, pandemic fatigue may be complicating the task of building such a network.

While the CSIRO researchers asserted in their article that this is the largest dataset of its kind, they also said that it is far from sufficient for global disease surveillance because there just is not enough annotation data. In fact, Bauer said that the problem is getting worse because there is less coordination between pathology labs and healthcare organizations now than there was earlier in the pandemic.

During the study period, which included data up to the "lock" date of Sept. 14, 2021, VariantSpark found that 0.3 percent of samples in GISAID had annotation data available. Even though the GISAID database now contains 11.7 million samples and CSIRO has since augmented its dataset to more than 23,800 entries, the annotation rate is closer to 0.2 percent, according to Bauer. 

"There's more samples submitted that do not have the patient outcome, where the patient outcome is labeled as unknown," she said. Healthcare systems have come to just treat COVID-19 patients without regard to the strain of the virus each patient has, so there is no outcomes data to link to viral sequences.

Indeed, Bell wrote in the Economist that cooperation in data sharing is "the greatest obstacle to the global surveillance system," hindered in part by concerns about data usage and privacy.

Bauer said it is imperative for the international research community to come together with national and regional healthcare provider networks to commit to building and maintaining a robust database of annotated viral sequences.