NEW YORK (GenomeWeb) – Researchers at Washington University have developed a bioinformatics pipeline that can be used on next-generation sequencing data to discover novel viruses or to analyze the viral makeup of a metagenomic sample.
The group described the pipeline recently in the journal Virology and has been using it for a variety of purposes, including the discovery of novel viruses in tumor samples, as well as on a project to look at the viral composition of children at risk for type 1 diabetes, lead author Guoyan Zhao said in an interview. The bioinformatics tools are free for use by other researchers.
Zhao, an assistant professor in the department of pathology and immunology at Wash U, said that the group's research has long focused on novel virus discovery. When NGS technology started becoming more widespread, the team needed suitable tools, first to work with sequence data from Roche's 454 platform and later to be compatible with Illumina's instruments.
Currently, bioinformatics pipelines to analyze NGS data for microbial sequences fall into two broad categories, Zhao said, each of which has drawbacks. Pipelines that focus on virus discovery from metagenomic samples, for instance, rely on matching nucleotide sequences to known viruses, so they can miss highly divergent viruses. The tools that seek to analyze the makeup of a mixed sample and provide quantitative estimations of the different viral genomes present in a sample lack a way to determine whether novel viruses are present.
Thus, the group wanted to design a pipeline that could address both of these issues, Zhao said. The team designed two complementary approaches: VirusSeeker-Virome and VirusSeeker-Discovery. VS-Virome was designed to define the type and abundance of viral sequences from a metagenomic dataset while VS-Discovery includes an assembly step, which helps more accurately identify whether a sequence is from a novel virus, a known virus, or not viral at all, and also increases the sensitivity of detecting highly divergent viruses.
One of the most important steps in developing the tools, said Zhao, was to curate and annotate viral databases, including both viral nucleotide and protein databases. Although the WashU team relied on data from the National Center for Biotechnology Information, Zhao said they spent a lot of time sifting through the entries to curate them and remove misannotated sequences. "Many people don't realize that there are a lot of false positives" when comparing a sequence to the database, she said. "The database only has viral sequences, but if a candidate sequence shares similarity to a virus sequence in the database, it will be called as viral," she said, "when in fact it might be [from] bacteria or fungi."
As part of the curation process, Zhao said, the team focused on eliminating duplicate sequences in the database, including only the longest sequences from each virus, since the longer the sequence, the more likely it would be to match only to a viral sequence and not to bacteria or something else. In addition, she said, the researchers removed misannotated sequences.
As of last August, WashU's viral nucleotide and protein databases consisted of 1.2 gigabases and 287 megabases of data, respectively, making them significantly smaller than the 122 gigabases of viral nucleotide data and 40 gigabases protein data that NCBI stores. An advantage of a smaller database, Zhao said, is that it enables matches to be made much faster. The researchers estimated in the study that alignment was 50 to 150-fold faster to their database.
Another key characteristic of both the VS-Virome and VS-Discovery pipelines is that the first step is to join the reads from Illumina paired-end sequencing because the longer reads create more accurate matches in the database. After that first step, the pipelines differ, with VS-Discovery going on to assemble reads into contigs.
The researchers also worked to boost both the sensitivity and specificity of the pipeline, looking for a good balance between not missing potential novel viruses while also minimizing false positive calls. One aspect is that reads that have significant hits to both viral and non-viral sequences are placed into a so-called ambiguous bin and can be analyzed later.
In the study, the researchers demonstrated the tools on stool samples from a monkey infected with SIV that was part of a study on vaccination. They used VS-Virome to analyze the impact of vaccination on the virome. In one sample, the team had detected several novel viruses from the Bunyaviridae, Circoviridae, Picobirnaviridae and Tombusviridae families. That monkey had not received a vaccine. Using VS-Virome, the researchers detected 2,155 candidate eukaryotic viral sequences, 574 of which were judged as true viral sequences and included novel viruses. Next, they applied the VS-Discovery pipeline to the data in order to get longer contigs and detected 130 contigs that were deemed to be true viral sequences. The VS-Virome pipeline identified three reads that shared around 30 percent of their sequence with the Pacui virus, an unclassified virus in the Bunyaviridae family. However, using VS-Discovery, the researchers found a contig just under 7-kb in length that had high sequence similarity to an RNA polymerase from the Batai virus, which is also from the Bunyaviridae family, and that contig was nearly identical to the three reads identified by VS-Virome. While further studies will be needed to classify the virus, it could "represent a novel genus in the Bunyaviridae family," the authors wrote.
Zhao said the researchers are continuing to use the tools on a number of different projects.
In one project, Zhao said, they are studying the viral composition of stool from children at risk for type 1 diabetes, comparing the viral make up of samples monthly from birth to the age of three. The researchers are looking at whether and how the viral composition changes over time and at differences between the kids who develop diabetes versus those who do not.
The researchers are also using the pipeline to look for novel viruses in patient samples, including cerebrospinal fluid, tumor tissue, and nasopharyngeal swabs, Zhao said. In addition, the team published a study in Cell Host & Microbe last year using a beta version of VS-Virome to look at patients infected with HIV and comparing viral composition of the stool of those who had received therapy to those who had not as well as to individuals without HIV.
Recently, the group embarked on a collaboration with the National Primate Center to identify viruses that cause diarrhea in monkeys. Zhao said it's a common problem among monkeys involved in medical research, but the etiology is unknown. For that study, the group is using the VS-Discovery tool to identify novel viruses. "But, the question is which ones are associated with disease," Zhao said. "The follow up is more challenging than actually identifying the viruses."