NEW YORK (GenomeWeb) – Researchers from the Wellcome Trust Center for Human Genetics (WTCHG), the Institute of Molecular Medicine at the University of Oxford, and collaborators at other institutions recently published a paper in Nature Genetics that describes Platypus, an informatics tool that they developed that uses multiple informatics approaches to call variants in whole-genome and whole-exome data with high sensitivity and specificity.
The paper explains that Platypus, which can be used for both research and clinical applications, "combines a haplotype-based, multi-sample variant caller with local sequence assembly in a Bayesian statistical framework." It uses a local de novo assembly to generate candidate variants and then follows up with local realignment and probabilistic haplotype estimation steps. Besides being able to accurately call SNPs, insertions and deletions, and complex polymorphisms faster than existing methods such as the Broad's Genome Analysis Toolkit and SAMTools, Platypus also provides "reference calls and local linkage information between called variants" that can be used to estimate HLA genotypes directly from data, the developers wrote.
Full details of the software are provided in the paper and accompanying supplementary documents. The paper also provides details of Platypus performance when it was used to call variants in whole-genome and exome-capture data, and highlights its ability to call de novo mutations in familial trios. Also reported in the paper are the results of comparisons between the variant calling abilities of Platypus, GATK, and SAMtools.
It's interesting to contrast Platypus' approach to other methods that other programs use to call variants, Gerton Lunter, one of the authors of the paper and head of a statistical and population genetics research group at WTCHG, said in a conversation with BioInform this week.
A common approach is to map reads to a reference and then look for systematic differences between the reference and sample. It's an approach that works well in the vast majority of cases especially for calling SNPs but, among other problems, it's not as effective for larger and more complex variants, he said. Reference-free, assembly-based methods avoid the limitations of alignment methods, the paper notes, but high-computational requirements, lower sensitivity, and problems with repetitive sequences are some of the issues with this technique
Platypus combines the strengths of the aforementioned methods and a third approach — which works by using data from related samples — in a single framework. It divides the variant calling process into "a candidate generation stage, designed to optimize sensitivity; and a haplotype-based calling stage, designed for specificity," according to the paper.
In the first step, Platypus generates a list of candidate variants by aligning reads to a reference genome. Candidate variants are obtained from read alignments, variants identified by assembly, and known variants stored in public resources. At this stage of the analysis, "we only focused on getting very high sensitivity and we are not worried about generating wrong variants," Lunter said. The candidate variants from this step are used to generate a list of candidate haplotypes that serve as the input for the next phase of Platypus' analysis.
In the next step, "we take pairs of haplotypes and [ask for each], 'suppose this was the actual genotype in this local region of the [genome]?'" Lunter explained. If that's true then the reads should align correctly to one of the haplotypes in the pair. Platypus aligns the input reads to both haplotypes and computes a likelihood score that summarizes how well each possible explanation fits the data, he said. This is where "we get our high specificity because we are not relying on getting the alignment directly from the mapper, [rather] we realigned all the reads against all the haplotypes."
Besides accurate variant calling, Platypus is faster than existing methods, according to its developers. The software "uses no intermediate files, minimizes access to BAM files and has low memory and CPU requirements, resulting in 5-90x faster processing times than with comparable algorithms," they claim in the paper. Its speed, Lunter told BioInform, is basically due to the use of good software practices. "We tried to read the data in and keep it in memory and not access the file again unless we really have to," he said. Also, the researchers used Python to identify and test the best algorithms for Platypus and then recoded everything in C, which is faster programming language than the former, he said.
Lunter and his colleagues began developing Platypus about three years ago. It is now part of the standard bioinformatics pipeline used at WTCHG and has been used in projects both internally and abroad. Part of its attraction, Lunter said, is that the software does not require large quantities of compute power to run and it's easy to use, which sets it apart from some existing solutions that require expert help to install and run.
According to the developers, the software has been extensively tested on whole-genome, exon-capture, and targeted capture datasets in studies focused on conditions such as epilepsy and familial hypocalciuric hypercalcemia, and on mutations associated with breast and ovarian cancers. It has also been run on large datasets as part of the 1000 Genomes and WGS500 projects. Currently, Platypus is being used in the Mainstreaming Cancer Genetics program, a three-year collaborative effort launched last year with partners from academia and industry that aims to make genetic screening a routine part of cancer care.
For their next steps, the developers intend to develop a module for calling somatic variants in cancer. Platypus has already successfully been used for this purpose, but the developers plan to improve its abilities in this regard, Lunter said. They are also interested in designing a module for analyzing data produced by Oxford Nanopore's newly minted sequencing instrument.