NEW YORK (GenomeWeb) – In a recent study, researchers from the National Institutes of Health's National Human Genome Research Institute compared the performance of five variant detection programs, looking specifically at their ability to identify variants in pooled sequence datasets.
According to a BMC Bioinformatics paper, the researchers selected the Genome Analysis Toolkit's unified genotyper tool, CRISP, LoFreq, VarScan, and SNVer for their comparison and evaluated each tool's ability to detect variants in synthetically pooled sequence generated from single-sample sequencing data. They compared the programs in terms of overall run times and memory usage, balanced accuracy, and sensitivity and specificity for detecting true variants in samples.
Through the study, the team sought to offer a useful starting point from which researchers might make choices about which software programs are best suited for datasets comprising multiple individual samples. Pooling samples prior to sequencing is one way that researchers try to shave the costs and time that would otherwise be required to sequence samples individually, they wrote. It also increases the number of genomes and exomes being analyzed at a time and "could offer more comprehensive variant detection and better statistical power for variant association studies of genetic diseases," they added.
There are challenges associated with pooling. Researchers risk missing rare variants in datasets containing sequences from a large number of individuals. Existing software methods adopt various statistical modeling techniques to sidestep this particular problem, and analyzing their performance could provide a clearer picture of "the potential benefits and tradeoffs of using pooled sequencing data," the team wrote. Evaluating existing methods helps identify "optimal variant detection programs and the best methods to run them," which could be used for future studies that use pooled sequencing techniques, they said.
For the study, the researchers simulated pooled BAM files using Illumina read data from two separately generated datasets. They created several distinct pools that featured varying depths of coverage and numbers of samples per pool to see how all five solutions behaved under different scenarios.
In tests focused on detecting single nucleotide variants, GATK, CRISP, and LoFreq offered the highest balanced accuracy of all the methods with values 80 percent or higher in test datasets of varying per-sample depth of coverage and numbers of samples per pool. They noted that GATK had the best accuracy but its run time increased as the number of samples per pool went up.
VarScan and SNVer, in contrast, had balanced accuracy percentages lower than 80 percent. The researchers also reported that when the number of individuals per pool increased, all programs, except for CRISP, had higher false-positive rates, which in turn reduced their overall balanced accuracy. Also, when the coverage for each sample was reduced, the sensitivity of each program dropped while the number of false-positive calls improved.
In terms of detecting rare variants, GATK and LoFreq had the highest sensitivity scores, but GATK also called a large number of false positives compared to LoFreq. Moreover, LoFreq did not require users to specify sample ploidy upfront, making it a more straightforward option for analyzing data containing mosaics — somatic variants that are found only in a fraction of cells, the researchers noted. Meantime, VarScan and SNVer had generally lower false-positive rates, but called variants with significantly lower sensitivity than the other three programs that were tested as part of the evaluation, the researchers wrote.
In terms of speed and memory use, CRISP and LoFreq had the fastest run times of the programs tested, requiring up to four times less computational time and up to 10 times less physical memory than GATK, which struggled to process pools containing a larger number of samples — 16 samples and higher — in a reasonable time frame. "Still, users wanting optimal sensitivity for smaller pools may find GATK to be worth the investment of increased time and memory requirements," the researchers concluded.
The GATK is developed and maintained by the Broad Institute. The Comprehensive Read analysis for Identification of SNPs from Pooled sequencing, or CRISP, was developed by a researcher now in the pediatrics department in the school of medicine at the University of California, San Diego — at the time of its development, the author was at the Scripps Translational Science Institute.
LoFreq was developed by researchers at the Genome Institute of Singapore; the variant detection in massively parallel sequencing data, or VarScan, software comes from researchers at Washington University in St. Louis; and SNVer was developed by researchers at the New Jersey Institute of Technology, the Hospital for Sick Children, and the Children's Hospital of Philadelphia.