CHICAGO (GenomeWeb) – The Broad Institute this week released GATK4, the much-anticipated version 4 update to the Genome Analysis Toolkit, following a two-year redesign.
"It involved a fairly extensive reengineering of the underlying framework," said Geraldine Van der Auwera, associate director for outreach and communication for the Broad's Data Sciences Platform.
"We pretty much wrote everything from scratch," added Eric Banks, senior director of the Data Sciences Platform.
The changes are meant to address three issues Broad scientists and the 55,000-user GATK community identified. They wanted the update to be fully open-source, to offer better performance, and to add new analytics capabilities.
"We've committed that all software produced by the Data Sciences Platform at the Broad would now be fully open-source," Banks said, a policy took effect in May 2017.
"In the past, we had focused just on the germline side and non-somatic small variants like SNPs and indels," Banks explained. "Now we have decided to expand and focus on two larger variant types, copy number variations and structural variations, in the somatic world and in RNA." GATK now also covers cancer and targeted genome sequencing assays, in addition to germline sequencing.
"Traditionally, you get one software package developed by a different group for each type of variant, but when you're doing your analysis, you care about all the variants together," Van der Auwera said.
"Being able to provide people with a toolkit that covers all of the use cases means they don't have to deal with a multiplicity of packages that need to be updated, that may not always be compatible in terms of the inputs and outputs," said Van der Auwera, a microbiologist by training. "Being able to provide kind of a one-stop shop for all your variant discovery needs is part of what we're trying to do with GATK4."
The Broad Data Sciences Platform team, working with a variety of academic and commercial partners, including some of the world's largest cloud hosts, has optimized the new GATK to run in multiple settings, including in the cloud, on consumer-grade PCs, and in high-performance computing environments.
"We realize that genomics in general is going to the cloud. We worked really hard to make GATK4 run on a variety of different cloud infrastructures," said Broad Chief Data Officer Anthony Philippakis.
One of Broad's technology partners is Alibaba, the Chinese internet behemoth. During a Facebook Live webcast to introduce GATK4 on Tuesday, Alibaba cloud computing expert Heshan Lin said that his company is working with Intel and the Broad to establish a GATK community in China.
"One of the things that we are trying very hard to do with GATK4 — and this goes back to the open source — [is that] we see GATK4 as truly a community-driven effort," Philippakis said. "It's a resource for the world. We want to see it contributed to and extended by a multiplicity of groups in both commercial and academic settings."
To this end, each GATK4 tool has two implementations, according to Banks: a Spark-based version to address the scalability issue, and the traditional non-Spark version. "This allows users to be flexible in how they run it," he said.
"For an individual user, it comes with the burden of having to set up a Spark cluster, which not necessarily everyone wants to do, and that's why we wrote it to have two modes of operating, one that's Spark-based and one that isn't," Philippakis said.
"We're expanding the scope of possibilities here. We want to make sure that GATK4 runs great in a public cloud, but we want to stress that that didn't come at the expense of being able to run it in on-prem environments," explained Philippakis, a cardiologist and bioinformaticist.
In terms of performance, the Broad is claiming that GATK4 can conduct germline variant discovery for large cohorts as much as 15 times faster than the previous version, by virtue of a new data store that the Intel-Broad Center for Genomic Data Engineering developed to open up a longtime processing bottleneck.
"What was added there in GATK4 is the ability to scale much better to call large cohorts of genomes," Van der Auwera explained. "[This means] being able to include more samples in your analysis and have that done faster."
GATK4 also incorporates machine learning and neural networks.
"One of the things we're excited about with GATK4 is that there is a whole suite of new and cutting-edge tools, many of which leverage advanced machine-learning capabilities," Philippakis said. These include a copy-number caller and a deep learning-based variant caller in prototype form, he added, tools that the Broad will be releasing in the near future.
"When it comes to tools to enable precision medicine, we feel that GATK is very well-positioned to drive the next generation of scientific advances around human health and disease," Philippakis said.