A preliminary comparison of variant-detection algorithms in Illumina's GenomeStudio Software and the Broad Institute's Genome Analysis Toolkit indicates that tools are comparable when it comes to calling SNPs in short-read sequencing data, though GATK outperforms the Illumina software in terms of calling insertions and deletions.
As described in a paper published online in Nature Precedings, Denis Bauer, a University of Queensland researcher, compared the variant-calling capabilities of Illumina's Consensus Assessment of Sequence and Variation, or CASAVA, against the Burrows-Wheeler Aligner pipeline in the Broad's GATK and found that the former did not do as well as latter when used to call indels.
The findings have not yet been peer reviewed.
Bauer told BioInform that she was motivated to compare the two packages following the release earlier this year of Casava 1.8, which includes improvements to its local read realignment capability.
Illumina developed the software to capture summary information for resequencing and counting studies within its GenomeStudio software. The software generates base calls for positions in the genome, calls SNPs, detects indels up to 100 base pairs from paired-end data, and generates lists of counts for reads that align to genomic entities such as genes, exons, and splice junctions.
But in spite of the improvements to CASAVA, academics aren’t including the software in their comparisons with other tools, she told BioInform, arguing that it should be included in more assessments because "it is going to be or already is a solid competitor in terms of SNP calling."
In the paper, the Broad team reported that GATK identified 16,152 known variants in a 1000 Genomes Project sample with a Ti/Tv value of 3.27 while Crossbow identified 16,086 and a Ti/Tv value of 3.26. Additionally GATK found 1039 novel variants while Crossbow found 1806 with Ti/Tv values of 2.57 and 1.94 respectively.
Based on these and other metrics, the researchers concluded that the Crossbow call set "has a lower specificity" than the GATK pipeline.
Even though "CASAVA hasn’t been included in any comparisons yet ... I was really surprised at how good it is and how solid the results are that it's producing," Bauer said. "I think we are currently in transition period where CASAVA will be recognized as a fully fledged software tool."
For her comparison, Bauer used three datasets from chromosome 21 of a Yoruba trio from the 1000 Genomes Project. Each contained about 80 million reads from dbSNP.
She explored whether SNP calls made on BWA-mapped reads with GATK's Bayesian model — which takes multiple samples into account — provided better results than CASAVA's approach, which similarly "realigns reads locally to improve the initial read mapping" but than "calls SNPs for each sample individually."
Comparing calls made by both tools to a set of SNPs from the 1000 Genomes Project for two of the three HapMap samples showed that both methods performed comparably.
In one sample, for instance, NA18507, CASAVA identified 72,709 SNPs while GATK found 74,592 and the 1000 Genomes set had 60,982 SNPs. Calls from GATK and CASAVA disagreed with those in the 1000 Genomes set by 12 percent and 9 percent, respectively.
Bauer reported that GATK also called 15 Mendelian errors — alleles which could not have come from either biological parent by inheritance — while CASAVA reported one. These discrepancies were likely due to erroneous calls made in areas of the genome that have "large fluctuations in coverage," she wrote.
For calling indels, Bauer used a version of Broad's GATK that runs Dindel, a program for calling small indels from short-read sequence data.
Dindel accepts BAM files containing read alignments as input and extracts candidate indels from the file. It then realigns the reads to candidate haplotypes that contain potential indels. Finally, it calls indels if there is enough evidence of an alternative haplotype for the reference.
In comparing the two methods' indel-calling capabilities, Bauer noted that CASAVA adopts a more "conservative" approach, which results in a higher concordance rate with known indels reported in dbSNP.
In one comparison, involving sample NA18507, CASAVA found 12,094 indels of which 8,463 corresponded to known indels while GATK found 15,870 of which 11,280 corresponded to known indels. Additionally, 84 percent of CASAVA's calls were also found in the GATK set.
However, CASAVA predicted a higher percentage of new indels — 38 percent compared to 33 percent by GATK — with more errors that were spread throughout the chromosome rather than localized to areas of fluctuating coverage as in GATK's case. As a result, it is likely that these new indels are errors, Bauer wrote.
She also noted that GATK's Dindel algorithm includes a second realignment step, which improves the software's indel-calling ability because it uses "local realignment for areas that are putative indels."
Additionally, Bauer suspects that GATK called indels better than CASAVA because of its ability to take multiple samples into account, while the latter works on an individual basis.
However she pointed out to BioInform that indel calling is still in its infancy and "there is no silver bullet yet."
She also noted that Dindel is designed for small indels. CASAVA incorporates a method that works for longer indels but Bauer hasn’t compared it to other similar programs and does not plan to at present.
In addition to comparing the variant-calling capabilities, Bauer also compared the mapping abilities of both methods.
Both tools performed comparably, with each mapping about 14 percent of the reads to chromosome 21 with a mean coverage of about 23 for CASAVA and 22 for BWA respectively.
Based on her findings, Bauer concludes that while CASAVA does not "deliver the same quality in the indel-calling set compared to the newly incorporated Dindel-algorithm of GATK," it is useful for SNP calling.
As a result, "it remains the best practice to use CASAVA 1.8 for producing fastq files and switch at this stage to the academic tools for mapping, alignment improvement, and variant calling," she wrote.
[ pagebreak ]
A Biased Comparison?
Dipesh Risal, a senior product manager for informatics at Illumina, told BioInform that incorporating prior information on indels from dbSNP and then comparing the results from both analyses to information in the same database, might have tipped the scales in GATK's favor.
“We are very grateful for the analysis Denis is doing and are currently in discussion with her about fine-tuning the post-analysis," he said in an email. "We would like to see a comparison that is independent of prior information such as dbSNP, rather than realigning around dbSNP indels and afterwards using dbSNP as one of the standards of correctness."
GATK is designed to take prior information on indels into account and to weight calls based on that knowledge. Conversely, CASAVA wasn’t designed to include this information so that it can provide an "unbiased calling method," Risal explained.
He added that databases like dbSNP contain information that’s biased towards well-known genomes and might not take into account novel variants that are present in specific populations that have not yet been sequenced.
Alternative scenarios would include "a comparison to other databases, or capillary sequencing to confirm the indel calls," Risal said. "This would give us a more apples-to-apples comparison."
However, rerunning the analysis without the data from dbSNP "does not change the results very much" and "does not explain the gap in performance between GATK and CASAVA," Bauer said.
She told BioInform that after rerunning the analysis without the dbSNP data, GATK still called 15 percent more indels than CASAVA.
These results are not included in the current paper.
"They are mostly known indels but since the concordance rate went down they are likely to be false positives," she said in an e-mail "However, this can't explain the difference in performance between CASAVA and GATK in general and certainly can't explain the higher number of alternative alleles CASAVA calls ... the difference might indeed come down to GATK using Dindel, which does a second round of local realignment."
Furthermore, Bauer said she ran the analysis using the recommended default settings to generate the results, which most users would likely do. "Users are not interested in how the programs achieve their indel calls and what external knowledge they utilize but which ones give them the best results," she said.
As a next step, Bauer plans next to repeat the CASAVA-GATK comparison on datasets that were sequenced in house at the University of Queensland. She plans to use a new GATK approach that doesn't require hard filtering of variants.
Eric Banks, team leader of the Broad's genome sequencing and analysis group's tool development team, told BioInform that while Bauer did the "right type of analysis" and it was "done correctly," the findings might be biased since GATK was used to produce the 1000 Genomes datasets.
He conceded that the pipeline used in Bauer's analysis isn't the same as the one that was used for the 1000 Genomes project, but noted that "there would never have been a site called in the 1000 Genomes if we hadn’t called it in the first place."
"I might have chosen a less biased base comparison," he added "You can do an analysis of the call set without actually having a reference call set to compare against." For example, "you can ... subset the exome and take a look at some of the calls within the exome ... how many of them are deleterious?"
Additionally, he pointed out that the SNP call sets produced in the analysis are rather "raw" and weren't subjected to filtering, which would usually be done based on many covariates of error.
Similarly, the indel calls seems to have had some filtering but not enough, Banks said.
"The way it works in our pipeline is that the raw calls coming off the Unified Genotyper – GATK's SNP and indel caller — are meaningless," he explained. "We care about post-filtered calls ... the call set [Bauer] produced is pretty raw and isn't one we would have produced in our pipeline, for example."
The Broad has since adopted a new statistical method, other than the Dindel algorithm, for filtering variants that does not require hard filtering, Banks said. Instead it looks at many covariates of error such as grand bias and allele balance, and then "statistically determines which calls look most like true calls."
Banks hasn’t compared the two packages or used CASAVA. However, he pointed out that unlike GATK, CASAVA hasn’t gone through a peer-review process where the research community would have an opportunity to provide feedback.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com