NEW YORK (GenomeWeb) – Researchers from the Institute of Clinical Molecular Biology (IKMB) at the Christian-Albrechts-University Kiel have developed a method of detecting viral sequences that are integrated into host genomes during infection, an advance that could eliminate or cut down on false positive virus detections.
The tool is called the Virus Integration Detection by Paired End Reads (Vy-PER) and it is an open-source package of Linux command line tools for highly sensitive and specific detection of virus integrations into a host genome from Illumina paired-end whole genome or whole transcriptome data. Michael Forster, a scientist in IKMB’s genetics and bioinformatics group and one of Vy-PER's developers, presented the method in a talk he gave at the high-throughput sequencing special interest group meeting that was part of this year's Intelligent Systems for Molecular Biology conference held in Boston July 11-15.
By inserting their genomic code into the host genome, harmful viruses like HIV and HPV ensure that pathogenic proteins are still produced even after infections are treated. In addition, there are ongoing efforts to use viruses as vectors to deliver treatments to patients. It is crucial to monitor these vectors to ensure that they don’t cause additional harm by randomly inserting on or near an oncogene and causing cancerous cells to develop, for instance.
Studies that explore how viral-host genomic integration occur have to deal with false positive virus detections, a problem that arises when algorithms incorrectly assign reads that are actually from the host genome to viruses, the IKMB researchers explained in their conference abstract. To address this issue, "we identified highly effective filters that increase specificity without compromising sensitivity for virus/host chimera detection after paired¬-end sequencing and BWA ¬alignment," according to the abstract. It's these filters that are implemented in the Vy-PER software.
Specifically, Vy-PER brings together three existing alignment tools: BWA, BLAT, and Smith-Waterman, as well as the Phobos STR-typing tool, which is used for identifying low-complexity sequences, Forster explained to BioInform. In addition to these, "we provide a wrapper and additional logic, filtering, and [other features] in our tool." Vy-PER is able to run analyses in parallel on multiple compute nodes of a Linux cluster and then merge the individual results into a single final result.
Vy-PER owes its origins to a childhood acute lymphoblastic leukemia (ALL) study at IKMB that is funded by the German Office for Radiation Protection, which aims to identify somatic mutations and structure variations that are associated with the disease, Forster told BioInform following his presentation. This study is part of a larger one called the International BFM Study Group, which promotes both research and clinical care for children and adolescents with leukemia and lymphoma.
For their part of the study, Forster and colleagues have collected and sequenced 20 tumors and matched germline genomes from 10 patients to 80x and 40x coverage, respectively. While mapping these reads to the human reference genome using standard methods such as BWA and Bowtie, the researchers discovered sequences that did not map to the human reference, and they believed that these might be potential viral sequences.
But there was also a possibility that these hits were actually short tandem repeats or homopolymers that were simply mismapped, Forster said. That's because methods like BWA and BLAT are approximate, meaning that occasionally they may not compute the correct match between a sequence and the reference, he explained. "Normally, existing algorithms won't map an entire 100 base-pair read [for instance] to a virus reference. Instead, they'll map maybe 30 or 50 base-pairs, which may be very low complexity … and that's where we found that people may get false positive hits from."
Vy-PER, for its part, uses an exact alignment algorithm to align candidate viral sequences to a preset "window" of the human genome, he explained. If the reads don't map to the window — which can be changed if users prefer to use different parameters — then Vy-PER uses the Smith-Waterman algorithm to compare the candidate sequences to the rest of the human genome. This is the most time-consuming part of the process, Forster admitted. For example, if a user has about 1,000 potential viral sequences and a single compute core, they would need a full day to compare the sequences to the entire human genome, he said. The IKMB researchers have a local field-programmable gate array-based supercomputer with Smith-Waterman implemented on it, and so they were able to run the same analysis in under a minute, he said.
Forster's team has tested Vy-PER on data from their ALL study. When they used Vy-PER to explore the data, they could find no significant evidence for virus integration in patient samples. In a separate test, the researchers also applied the method to data from transcriptomes of a published liver cancer that was known to include sequences from the hepatitis B virus.
Results from the analysis of the ALL dataset indicated that there was no significant evidence for virus integration in the patient samples. In tests that used the liver cancer genomic data, "our method eliminated 6,400 false positives per 40x genome," the conference abstract stated.
Forster and his colleagues are currently revising a manuscript that describes Vy-PER and provides the results of benchmarks against other tools, among other details, that they plan to submit for potential publication in Nucleic Acids Research at the end of the month.
They are also looking at ways to get Vy-PER to run faster. Currently, users who don't have access to sufficient compute power can run BLAT as an alternative to Smith-Waterman using special settings that try to compute the alignments exactly, Forster said. "BLAT uses a single core in the current version [of Vy-PER], and in the future version [we'll be able to] run it on as many cores as the user wishes. So, for example, on a 16-core compute node the run time can be cut from more than a day to a couple of hours."
Longer term, "we want to make our FPGA supercomputer accessible on the web for registered users who wish to use the Smith-Waterman aligner," he said.