Skip to main content
Premium Trial:

Request an Annual Quote

Pandemic Accelerates Development of Informatics Tools for Viral Genome Analysis


CHICAGO – One lasting legacy of the COVID-19 pandemic for future public health crises may turn out to be the advancement of bioinformatics technology for understanding the spread of viral variants.

Examples includes developers of informatics tools at the TCS Research division of Tata Consultancy Services in India and at Rice University, who are both building on the past as well as looking ahead as they work to address the emergency in front of them.

A key focus is helping researchers parse the massive amount of genomic data being churned out every week. By one estimate, there have been more than half a million scientific papers on COVID-19 published or released in preprint form since the start of the pandemic, many of them focusing on the virus' genome.

With this glut of data, bioinformaticians and computational biologists at TCS Research set out to create a SARS-CoV-2 genome atlas. The centerpiece of this effort is a "computationally inexpensive and intuitive" tool that is scalable to provide visualizations of large collections of genome sequences, according to project leader Naina Tiwari.

Tawari described preliminary work on this visualization tool and genome atlas last month during the virtual Intelligent Systems for Molecular Biology and European Conference on Computational Biology (ISMB/ECCB) conference.

"It avoids expensive multiple sequence alignments and phylogeny computations," Tiwari said. "It can incorporate new sequences without much computational burden, and it can support visualizations based on genomic regions of interest."

Tiwari called this yet-unnamed software tool a "fast and inexpensive approach that efficiently extracts variant-level features of strains and [uses] compute embedding that considers local similarities among strains."

The technology looks for "meaningful clusters" in datasets through an approach the TCS Research team called a "bag of variants." Tiwari told GenomeWeb via email that this was inspired by bag-of-visual-words embedding that is common in computer vision applications, which itself is based on a technique in natural language processing called bag-of-words.

A "bag of variants" is a collection of variant clusters that feeds the visualization engine through a low-dimensional embedding of sequencing data.

The researchers took in FASTA files of SARS-CoV-2 sequences and related metadata from the GISAID repository. "While visualizing the data points, we use color coding … to incorporate the metadata information. This helps in bringing out the spatial, temporal, and clade level evolution of the data in the canvas," Tiwari said during the conference, which was hosted by the International Society for Computational Biology.

The GISAID data mostly contained sequences obtained through January 2021, though the researchers performed some visualizations of about two months' worth of sequences of the B.1.617 (Delta) variant, which was not identified until March.

The TCS researchers performed their computations and visualizations on a standard desktop computer. According to Tiwari, the most time-consuming part was the alignment of candidate strains with a reference sequence, a one-time preprocessing step that took about 12 hours for a dataset of nearly 260,000 sequences.

Preliminary data presented at ISMB/ECCB showed that the TCS visualizations were successful at capturing sequence divergences and clusters as well as mapping temporal and clade-level evolution of the virus. Tiwari said that a complete manuscript with a wider set of experiments will soon be submitted to a peer-reviewed journal.

This first iteration relied on simple clustering to identify the "bag of variants," according to Tiwari. She said that the technology is capable of managing more sophisticated techniques, such as processing sequence data and metadata together.

"We believe that our approach can potentially serve as a valuable visual aid in analyzing large collections of genomes, including metagenomes and other pan-genomes beyond the COVID-19 dataset," she said via email.

Tiwari explained that she and her colleagues were not trying to identify specific mutations. "The goal of our work is to provide an inexpensive and easy visual aid for analyzing large collections of genome sequences," she said. "These visualizations can complement other, deeper analysis approaches such as the standard phylogeny-based methods, which are computationally involved for large datasets."

The atlas is not yet available for download but Tiwari said that she and her colleagues are in the process of creating a repository so others can access the TCS Research visualizations.

In another approach to making sense of burgeoning SARS-CoV-2 datasets through techniques including visualization, Rice University is working with Signature Science to adapt the existing Harvest suite of open-source alignment and visualization tools for COVID-19.

Rice has a 12-month, $630,000 contract with the US Centers for Disease Control and Prevention that started in late June to create a version of the software called Harvest Variants to track SARS-CoV-2 variants. The research laboratory of Todd Treangen, a specialist in computational microbial forensics at the Houston school, applied nearly $250,000 of that total to subcontract biocuration and other services to Austin, Texas-based SigSci, a subsidiary of the Southwest Research Institute.

The Harvest software grew out of a collaboration Treangen participated in nearly a decade ago when he was working on a contract with the US Department of Homeland Security's National Biodefense Countermeasures Center. Collaborators on that project included Adam Phillippy, who now heads up bioinformatics at the National Human Genome Research Institute.

The Harvest software suite was born in 2012 when that group found other alignment software too slow for processing more than about 100 microbial genomes at a time. Harvest helped Treangen and colleagues scale from hundreds to thousands and tens of thousands of genomes.

The suite includes an aligner named Parsnp, as well as a graphical user interface called Gingr for viewing variants and phylogenic trees. It also has a component called Harvest Tools, which Treangen called a "Swiss Army knife of [file] conversion."

A 2014 paper in Genome Biology described how Harvest improved on earlier alignment and visualization software. As it turns out, one tutorial use case for that work was a seasonal cold coronavirus, Treangen said.

Treangen said that when the COVID-19 pandemic hit, his mind went back to the earlier coronavirus example. This time, he had to scale Harvest from tens of thousands to 1 million or more genomes at a time. "That was motivational to me for why I wanted to go to the CDC and apply for this opportunity," he said.

In addition to enhancing earlier versions of Harvest, the developers are adding new capabilities, most notably the ability to detect SARS-CoV-2 variation between different hosts, he said.

In a paper published earlier this year in Genome Research, Treangen and colleagues found that about 5 to 10 percent of SARS-CoV-2 genomes in an infected person have variants from the consensus sequence.

This low-frequency variation might not get passed from person to person, but it can indicate how an individual fights the infection, which can inform test, treatment, and vaccine development, according to Treangen. This information might also provide clues about the types of mutations that could evolve into variants of concern.

Harvest was originally designed to explore single genomes. An update for the COVID-19 era has added low-frequency variant information across as many as a billion copies of viral genomes.

"No one's really tracking this information," Treangen said. "The files are big [and] it's computationally expensive."

Analysis tools typically compress large numbers of copies down to one to save disk space and computational time, but that eliminates information on low-frequency variants. This is where SigSci comes in.

"You do have to figure out a clever way of not showing duplicate information and really just highlighting low-frequency variation of note inside of a person without just creating a bunch of extra information that could cloud out important other things," Treangen said.

SigSci is bringing its biocuration experience and expertise to the project. The firm has built a database from the medical literature that Harvest can quickly cross-reference with mutations of interest.

Because papers related to SARS-CoV-2 are still proliferating at a breakneck pace, a student in Treangen's group at Rice is looking into incorporating natural-language processing techniques to mine the White House's COVID-19 Open Research Dataset and present results to accelerate the work of SigSci biocurators.

The developers are building these enhancements in a way that Harvest can be easily adapted to future biological threats.

Treangen said that in an early iteration of the update, Harvest can show mutations in the coronavirus spike protein and the prevalence of certain variants in specific regions of the world. "There's nothing that will limit it from being applied to other viruses," he said.