Researchers from the Broad Institute's computational research and development group have launched Discover Variants through Assembly, or DISCOVAR, a new software tool for assembling small genomes de novo and calling variants in genomes of any size.
DISCOVAR was developed by the same research team that built ALLPATHS-LG, a short-read genome assembly algorithm that is capable of assembling large genomes de novo (BI 1/7/2011); and ARACHNE, a whole-genome shotgun assembler.
David Jaffe, the director of computational research and development in the Broad's genome sequencing and analysis program and one of DISCOVAR's developers, told BioInform that his team developed the tool so that they could analyze data that was generated using an improved Illumina PCR-free protocol for creating sequencing libraries. They also wanted to take advantage of longer read lengths that the company's MiSeq and HiSeq 2500 instruments could generate — 250 base pairs in length.
They also wanted to offer a variant caller that could work with assemblies based on the new Illumina data and that performed better than current variant calling tools, he said.
DISCOVAR is designed to work with data from a single Illumina paired end library generated by the PCR-free protocol with fragment sizes of approximately 700 base pairs, from which are generated 250-base pair end reads.
The tool represents genome assemblies as graphs with edges that represent sequences, its developers explain. Each edge, according to its developers, is given as a record in a FASTA file, with graph connectivity information recorded in the header lines.
In terms of variants, DISCOVAR reports those it finds in a human-readable plain text file using a format that was developed specifically for the software. The developers are currently working on translating it to the Variant Call Format, which is the standard format for storing sequence variants. As part of that effort, they're also trying to "figure out the best ways to represent complex variations that [DISCOVAR] can detect" using VCF, Iain MacCallum, assistant director of genome assembly at the Broad, told BioInform.
DISCOVAR, which was released earlier this month, is designed for assembling and analyzing single samples rather than populations. This makes it ideal for studying genetic variations underlying Mendelian diseases as well as tumor genomes, according to its developers.
"We are explicitly targeting …problems that focus on the genome of an individual patient," Jaffe said. "Our goal is to get the quality as close to clinical gold standard as we can."
It can also be used to assemble microbial-sized genomes and also small portions of larger genomes. The developers are working to incorporate capabilities for assembling larger genomes within the next few months, Jaffe said.
Other development plans include adapting DISCOVAR for use with data from Oxford Nanopore's sequencers, enabling the use of jumping library data, and potentially allow the use of shorter Illumina reads.
Currently, the team is comparing DISCOVAR's performance with existing software such as Cortex — a genome assembler and variant caller developed jointly by researchers at UK's Genome Analysis Center and University of Oxford. Jaffe could not disclose details about the results of these ongoing comparisons but he did say that the findings will be included in a paper describing DISCOVAR's development and applications that will be published at a later date.
The software is free for academic use but commercial companies will be charged a yet-to-be-determined licensing fee to use the tool.