Researchers from the University of California, San Diego, the Harvard School of Public Health, and the Genomics Institute of the Novartis Research Foundation, have developed a freely available bioinformatics pipeline called the Plasmodium Typing Utility Software (Platypus) that uses a mix of well-known and novel algorithms to identify genomic variants in short-read sequencing data from pathogenic and non-pathogenic eukaryotes such as Plasmodium.
The researchers wrote in BMC Bioinformatics that they developed Platypus to help researchers take advantage of "opportunities for more comprehensive analysis of the genomes of simpler eukaryotes" that high-throughput whole-genome sequencing methods have made possible. "Full genome sequencing at 30-40X coverage is now readily achieved," they wrote, and "such coverage allows for the identification of recombination events, the description of [single nucleotide variants] in sequences other than in the exomes, and the detection of small structural variants, including short-length insertion or deletion events." Being able to locate genetic mutations in pathogens that enable them to resist drug and vaccine candidates will provide researchers with a "powerful tool to choose the best combination of agents to treat infectious diseases … study pathogen population dynamics and transmission, as well as engineer new [and better] treatments."
Programs like the Genome Analysis toolkit, which are accepted as the standard for human genome variant calling, are not as effective on smaller organisms because they are "generally designed to be conservative" in their approach to identifying variants, casting a wide net to ensure that no variants are missed. That works well for the large, complex human genome, but pathogens have a much smaller and simpler genome and broad parameters for variant calling result in rather noisy data, according to Micah Manary, a graduate student in UCSD's medical scientist training program and one of the authors of the study.
Furthermore, many existing tools are tailored to work with data from diploid genomes while most pathogens have haploid genomes. There are also differences in things like copy number variant size — these are much larger in humans than in Plasmodium, for example — as well as differences in the way recombination events occur in pathogens versus humans which have to be taken into account when designing the pipelines, he told BioInform.
Platypus uses the Burrows-Wheeler Aligner to align whole-genome sequences in FastA/FastQ or BAM file formats and the Genome Analysis Toolkit to call single nucleotide variants, in both cases using a set of filtration parameters that the team identified as optimal for identifying true SNVs in eukaryote genomes. To come up with these parameters, the researchers obtained over 15,000 variants from Plasmodium falciparum strains — the member of the genus responsible for malaria in humans — that had been generated and validated using methods other than whole-genome sequencing, for example, microarrays and Sanger sequencing, Manary explained. These variants distinguish the multidrug-resistant P. falciparum Dd2 strain from the P. falciparum reference (3D7 strain), according to the paper.
Next, the researchers wrote a machine learning algorithm that compared whole-genome sequence data from Dd2 to data from the Plasmodium reference trying multiple combinations of parameters in order to find the optimal approach for calling the known SNVs. The final set of 17 parameters enable the pipeline to detect known variants with 90 percent sensitivity and 85 percent specificity, according to the team.
The pipeline also includes improved algorithms for using depth of coverage information to call copy number variants that includes improvements to GC bias correction and a new method of "smoothing" depth of coverage data. It also includes an approach for identifying recombination events — also implicated in greater pathogenicity — that involves "identify[ing] fragments with mated pairs that had abnormal insert sizes when they were aligned to a reference genome, especially ones with mated pairs that aligned to two different chromosomes or to vastly distant parts of the same chromosome." The paper includes a description of the pipeline validation process using data from 26 P. falciparumsamples.
AT UCSD, Manary's team is using Platypus to explore the diversity of P. falciparum and how that contributes to infections in populations in Peru and other countries, he said. They are also exploring the genetic genesis of drug resistance in P. falciparum, trying to discover whether it's the result of a drug-resistant strain flourishing and spreading while weaker strains succumb to treatment, or if the parasites evolve over time to resist the drugs, he said.
Although the researchers chose P. falciparum as their development and testing ground, the pipeline can be adapted to work with data from other eukaryotes. One group, for instance, is using the tool to explore the
The team is also making improvements to the software. For example, they want to incorporate tools that would provide users with more biological information about observed changes, not just noting that the mutations are present. "We want to be able to say things like … does [this mutation] change the protein folding prediction of that gene, [or] does it make something that's acidic into something that's basic," Manary said. They also plan to provide more detailed instructions and tools that would help potential users adapt the system to work with data from other eukaryotes, he said.