NEW YORK (GenomeWeb) – Researchers from the Joint Genome Institute have scoured metagenomic data to uncover sequences belonging to more than 125,000 viruses.
As the researchers reported in Nature today, this represents a 16-fold increase in the number of known viral genes. With their metagenomic approach that drew on more than 5 terabases of data, JGI's Nikos Kyrpides and his colleagues also uncovered the largest known phage and found that most viruses, but not all, have preferred habitats.
"It is the first time that someone has looked systematically across all habitats and across such a large compendium of data," Kyrpides said in a statement. "A key to uncover all these novel viruses was the sensitive computational approach we have developed along this work."
To tease viral DNA out of the large amount of metagenomic sequence data they gathered from the Integrated Microbial Genomes with Microbiome Samples (IMG/M) system — which covered more than 3,000 diverse geographical locales — he and his colleagues used viral protein families from both a biased collection of isolate viruses (iVGs) and from 1,800 manually identified metagenomic viral contigs (mVCs) as bait. With this approach, they identified 125,842 putative DNA metagenomic viral contigs.
This, the researchers noted, increased the number of known viral genes by 16.6 fold. Further, these contigs encode more than 2.79 million proteins, three-quarters of which show no sequence similarity to known viral isolate proteins.
When the researchers examined the mVC lengths, they noted that their lengths ranged from 5 kilobases to almost 600 kilobases and, based on end overlap, that nearly 1,000 of them were circular and were complete viral genomes. That largest circular contig, which was 596 kilobases in size, hailed from a bioreactor sample and contained many of the hallmark genes of a tailed virus and no evidence of bacteria housekeeping genes or plasmids. This suggested to the researchers that it's the largest phage yet identified.
Using a number of computational methods, the researchers attempted to connect the viruses to their hosts. One method they turned to relies on the CRISPR-Cas9 prokaryotic immune system, which collects fragments, or proto-spacers, from phages that have previously infected the host. The researchers developed a database of 3.5 million spacers from prokaryotic isolate genomes and metagenomes from IMG that they then used to match their viral gene set. Through this, Kyrpides and his colleagues identified the hosts of 9,992 of the viruses they'd uncovered — many of which were previously unknown.
That data also suggested that most viruses have a rather narrow host range, the researchers reported. Most viral sequences matched to hosts that belonged to the same species or genus, though some viral sequences did match to a broader range of hosts.
When the researchers aligned their contigs against all assembled and unassembled metagenomic sequences, they found 86 percent of their viral samples in more than one sample and 73 percent in more than five samples. Most of these samples were from marine or human-associated habitats that have been well studied. They then used these matches to examine the habitats from which these viruses hailed.
For marine habitats, most viruses separated into zone-specific groups, though some were present across zones — one viral group was found in 95 percent of all twilight zone samples and 44 percent of deep-ocean samples. Similarly, the researchers found that though 84 percent of their viral groups could be found in multiple samples, they lived in a single habitat type. And though 14 percent of the viral groups lived in two habitat types, those fell broadly within the same environmental category.
Based on this, Kyrpides and his colleagues also developed a map that linked the viral samples with their geographic coordinates.
"One of the most important aspects of this study is that we did not focus on a single habitat type. Instead, we explored the global virome and examined the flow of viruses across all ecosystems," Kyrpides added. "We have increased the number of viral sequences by 50x, and 99 percent of the virus families identified are not closely related to any previously sequenced virus. This provides an enormous amount of new data that would be studied in more detail in the years to come."