Developers of the Broad Institute’s Genome Analysis Toolkit have launched a beta release of the newest version of the software package and are inviting users to put GATK 2.0 through its paces ahead of a full launch later this year.
The beta release is free for academic use but commercial groups will have to request an evaluation license, Mark DePristo, who leads the genome sequencing and analysis group in the Broad’s Medical and Population Genetics program, told BioInform this week.
DePristo said that the Broad plans to put a licensing scheme in place that will govern commercial access to the full version of 2.0 when it is launched.
Pricing details for the commercial license are still being discussed, DePristo said, but he expects that the cost will be “within normal range for software in the bioinformatics space” and that it should take effect in “late 2012.”
Historically, GATK has been freely available to all users under a licensing structure put in place by the Massachusetts Institute of Technology.
The shift toward a commercial license has been “largely driven by requests from commercial users of the tools for much greater support,” DePristo said.
The needs of such users "really can’t be met in our current structure,” he noted. “We have many requests from people for capabilities that as an open source academic project, we simply [don’t] have the resources to provide.”
He explained that although updated versions of GATK come out every six to eight weeks, the Broad’s policy dictates that the developers only support the most recent version of the GATK at any point. This poses a problem for users who run the toolkit in restricted environments, for example in CLIA laboratories, and can’t update their toolkit as frequently. It is also an issue for those who don’t have the “tolerance and capacity” to keep up with the academic cycle, he said.
“In order to meet those needs, you really need to set up a model where you can fund those activities because they are another entirely new level of activity beyond developing source code and distributing it for academic research purposes,” he said.
When it kicks in, the commercial license will offer “long-term support for specific versions” so that these users can “feel confident that the version would be well supported for many years,” he said.
For now, commercial users can apply for an evaluative license for GATK 2.0 or use GATK-lite, which contains a subset of the tools available in the beta release and is still available under the existing MIT license.
The lite tool, which is also available to academics, includes all of the capabilities in the current release of the GATK — version 1.6 — but none of the updated tools that are available in 2.0.
Even after 2.0 is launched and the new licensing scheme is in place, the Broad team will continue to provide GATK-lite for free so that users who don’t want to concern themselves with licensing restrictions have access to the programming framework and tools from previous releases.
A Better Looking GATK
DePristo said the new GATK website presents a clearer picture of the tools available in the suite and provides improved discussion forums for the research community to provide feedback and share information with others.
In terms of the software itself, version 2.0 features new tools for variant calling that are “vastly better” than previous releases of the software, as well as an improved data compression algorithm, he said.
This includes a new version of the GATK’s base quality score recalibrator, which recalibrates well-known base quality scores as well as base insertion and base deletion quality scores.
The application “provides per read, per base, an estimate of the chance that the base was inserted with respect to the reference genome or deleted” providing an “empirically calibrated indel error model and that’s proven to be enormously important” for calling insertions and deletions, DePristo explained.
A second component of the pipeline is its haplotype caller, which calls SNPs and indels simultaneously via local de novo assembly of haplotypes in an active region.
DePristo explained that this approach “stops us from mistaking indels as SNPs” and “allows us to find true indel alleles even if they are not present in the alignments of the reads themselves.”
Furthermore, “because it's multi sample … if you have low coverage data … it can use all the data across the samples to determine the segregating alleles,” he said.
The toolkit also includes a new algorithm that uses read-based compression to reduce BAM files “on the order of 20 to 100 fold” and keep only essential information for variant calling, DePristo said.
The developers plan to publish a paper that will “articulate what are the advances, how they improve things exactly and in what areas,” DePristo said.
Also available from the website is a resource bundle that provides a collection of standard files for users seeking tools to work with human resequencing data.