NEW YORK – From a purple dot in Wuhan, China, lines start to emerge, reaching the UK, Thailand, and Australia.
New colors appear, representing genetic changes that occur in the SARS-CoV-2 coronavirus as it moves, infects large groups of people, and mutates. Soon, color-coded lines stretch across the globe, illustrating clusters of genetically similar SARS-CoV-2 viruses from individuals in different parts of the world.
Lines connect China to the US and Canada, Australia to Japan, the US and Canada, Europe and the US. A multicolored web forms across the globe.
The maps, featured on the website for the open-source pathogen genome project Nextstrain, are growing by the day and include data for thousands of publicly available SARS-CoV-2 genomes.
Emma Hodcroft, a postdoctoral researcher in molecular life sciences at the University of Basel and Swiss Institute of Bioinformatics and Nextstrain co-developer, explained that the viral movements illustrated on the site are "generally accurate," though she cautioned that the inferences are limited by the locations where viral genome sequencing is taking place, as well as the quality of the genomes submitted to sites such as GISAID or Genbank.
"If we don't have sequences from a country, it won't be included on that map — and we definitely don't have sequences from every country," she said, noting that this will inevitably give "a bit of a biased view of what happened," even before taking sequencing errors into account.
At the University of Florida at Gainesville's Emerging Pathogens Institute, virologist Carla Mavian is part of another team tracking the emergence of new SARS-CoV-2 genomes. She, too, cautioned against over-interpreting data from small sets of SARS-CoV-2 genomes or incomplete genome sequences, in part because so few differences exist between the coronavirus clusters.
In low-coverage sequences, Mavian explained, "there are a lot of unknown regions that aren't contributing at all if you do a phylogenetic tree, and are actually adding noise."
For a preprint added to BioRxiv earlier this month, she and her co-authors delved into the phylogenetic and phylogeographic information that could be gleaned from more than 2,600 SARS-CoV-2 genomes available from 55 countries as of the end of March.
They warned that while the "number of available full genomes is growing daily, and the full dataset contains sufficient phylogenetic information that would allow reliable inference of phylogenetic relationships, country-specific SARS-CoV-2 datasets still present severe limitations" and called for "continuing concerted efforts to increase [the] number and quality of the sequences required for robust tracing of the epidemic."
Mavian, who is originally from Italy, said she is eager to see more SARS-CoV-2 sequences from that country and from other places that have been hit hard by COVID-19.
"Imagine how the picture would change if we had homogenous sampling from Italy, which was one of the first countries in Europe that had a spike of cases? We would see a totally different tree, at least as far as Europe is concerned, because who knows how many transmissions have occurred from Italy to other countries," she said.
A malady on the move
Despite remaining gaps and uncertainties, researchers are learning from the growing collection of SARS-CoV-2 genomes. Indeed, the sequences are being used to explore everything from the potential origins of the COVID-19-causing coronavirus to its evolution and spread within and between countries and continents, along with the consequences of infection mitigation measures in different regions.
So far, new mutations have been arising in SARS-CoV-2 two or 2.5 times per month, on average, and the most widely differentiated viruses on the SARS-CoV-2 tree differ at just a few dozen sites across an RNA genome of around 29,000 bases. That means that while clusters of genetically similar SARS-CoV-2 are sometimes referred to as "strains," Hodcroft said, they remain very similar and do not have the diversity of competing influenza strains, for example.
Even modest changes in the SARS-CoV-2 genome are proving informative, though, particularly as more and more sequences are generated and analyzed together.
In general, for example, Nextstrain's phylogenetic analyses, along with epidemiological and surveillance data, support a big picture view of the COVID-19 pandemic that involved SARS-CoV-2 movement out of China to other parts of Asia, before the virus swept across Europe and North America.
But the finer details of viral spread are a bit more difficult to tease apart, especially complementary data sources, Hodcroft cautioned. While it may be possible to find genetic connections between viruses infecting individuals in Germany and Italy, for example, "it's very difficult for us to say which way that line goes, because there will be lots of samples that we didn't take."
"There are many more people who weren't sampled than who have been sampled," she explained, noting that connections between different parts of Europe could also occur if, for example, individuals at both sites were somehow infected by a yet-to-be-sampled person from China or another location. "There's always a lot of uncertainty."
Still, there are situations where Hodcroft and her colleagues at Nextstrain have gotten strong hints about an individual's COVID-19 infection — and about the viruses circulating in certain locations — based on the SARS-CoV-2 genetics.
For example, they identified COVID-19 cases in Canada, New Zealand, and elsewhere that involve a SARS-CoV-2 strain found in some individuals in Australia. The infected individuals did not know each other, but at least some of them share a point of connection: travel to Iran.
"Because their samples link so closely with the others that were in Iran, we can say with high confidence that those people were either infected in Iran or by someone who has come from Iran," Hodcroft explained.
So while no viral genomes have been reported directly from Iran, she noted, that set of samples offers a peek at viruses that may well be circulating there.
In the US, "we're hopeful that, as sequencing is starting to scale up, we get sequences from different places," Hodcroft noted. "We know that the epidemic was a little bit behind Europe, for example, so we might be able to tell a little bit better the story of how this moved around the US, how many introductions there were, this kind of thing."
In the meantime, the Nextstrain team, and other groups working independently, have put out a steady stream of preprint articles that highlight the kinds of information being gleaned from coronavirus sequences collected nationally and internationally — from SARS-CoV-2 spread within Iceland to the multiple sources of introduction and spread in specific parts of the US.
By sequencing nearly 350 SARS-CoV-2 viruses collected in Washington State prior to mid-March, including samples obtained from the Seattle Flu Study, for example, Hodcroft and colleagues from the US and Switzerland estimated that the virus was "circulating cryptically, i.e., undetected by the surveillance apparatus, in Washington State since January 2020."
The team shared its work in a post to MedRxiv last Monday, noting that "we underscore the following recommendations in settings where large-scale community transmission is not yet recognized: the importance of early identification of the virus, extensive testing of potential cases, and immediate self-isolation of infected persons."
Two independent research groups led by investigators at Mount Sinai's Icahn School of Medicine and New York University reported last week that viruses circulating in New York City could primarily be traced back to strains that reached the city via Europe, after the virus took hold on that continent.
A team from the UK, meanwhile, attempted to put together a SARS-CoV-2 phylogeny based on 160 SARS-CoV-2 genomes for a paper published in the Proceedings of the National Academy of Sciences last week. Based on their analysis, the authors proposed that the SARS-CoV-2 genomes "are closely related and under evolutionary selection in their human hosts, sometimes with parallel evolution events."
That work has been criticized online by some members of the bioinformatics and genomics community, though, in part because it relies on a phylogenetic tree that includes the bat coronavirus Bat CoV RaTG1, which differs from SARS-CoV-2 by at least 1,100 bases — a genetic chasm compared to the subtle genetic differences documented so far in the coronaviruses behind COVID-19.
Though she did was not directly referring to the conclusions of the PNAS paper, Hodcroft noted that the vast majority of the mutations in the thousands of global SARS-CoV-2 genomes assessed so far "as far as we can tell, are not meaningful."
"They're just the typos the virus makes as it replicates, and we don't see any evidence that these have a functional impact," she said. "We will need longer-term studies to know for sure."
Stanford University's Julia Palacios, a statistician who builds models for biomedicine, evolutionary genomics, epidemiology, and infectious disease, also noted that changes she's seen so far appear to be peppered randomly across the SARS-CoV-2 genome.
More to learn
Palacios and her team are interested in searching for signs of selection as more SARS-CoV-2 genomes accumulate and the virus continues to evolve. But they are also tapping into the data to track viral mutation dynamics as a potential measure of mitigation success.
Just as the mutations that crop up in SARS-CoV-2 over time can be used phylogenetically to inform transmission events, Palacios explained, the sequence data can be a valuable source of information on the effective population size of individuals being infected within a given area.
"The idea is that if you [take] a random sample from the population and [the viruses] are all very closely related, that means the population size is small. Whereas if they diverge a lot, that's an indication of the population size being large," she said, noting that the virus "needs to infect a lot of people" to evolve.
This analytical strategy provides a genetic look at how well efforts to stop the spread of COVID-19 are working in different places, including in California. Palacios is part of a team that plans to analyze sequences from samples collected at Stanford Hospital in the not-too-distant future — work that will be done in conjunction with investigators who are delving into host genetic features related to SARS-CoV-2 infection susceptibility and COVID-19 severity.
In China, preliminary results point to a period of exponential infections and relatively speedy viral change that was followed by a bottleneck in new mutations in SARS-CoV-2. Palacios is starting to see some of the same patterns in other parts of the world doing county-by-country analyses, though many spots in Europe and North America still appear to be in growth or stabilization stages.
"The next step is to combine this with surveillance data to try to improve, and have more fine estimations, of what is going to happen in the fall and what's going to happen in the coming months," Palacios said.