The 1000 Genomes Project is not only generating a wealth of data on human genetic variation, but is also the impetus behind a new software toolkit intended to give programmers a flexible framework for quickly writing tools for analyzing next-generation sequencing data.
Stymied by the massive amounts of data generated by the 1000 Genomes Project — which has so far produced nearly five terabases — researchers at the Broad Institute began developing a programming framework “that would structure how programs were written for next-gen sequencing,” Mark DePristo, a researcher at the Broad told BioInform.
DePristo and his colleagues published a paper on the framework, called the Genome Analysis Toolkit, or GATK, in Genome Research last week.
GATK is based on the idea that many tools for analyzing next-generation sequencing data access data in a very similar way. As a result, it separates data access patterns from analysis algorithms, and provides developers a set of "data traversals" and "data walkers" that they can string together to provide programmatic access to analytical tools.
The software has been "extensively tested" on data from the Illumina Genome Analyzer, Life Technologies SOLiD, Roche 454, and Complete Genomics platforms, the Broad authors noted.
"About a year and a half ago, it became clear at the Broad that we were having a difficult time writing tools that were able to handle the amount of data, didn’t consume a lot of memory, and were easy to distribute and use," DePristo said.
GATK is based on MapReduce, a framework developed by Google to support distributed computing with large data sets over parallel databases.
In addition to building the framework, DePristo and his team also developed a series of tools such as a base quality recalibration tool that handles local realignment of reads and two tools for calling SNPs, which he said are used quite often by other researchers.
One feature of GATK is that it breaks up terabytes of data into smaller, more manageable kilobase-sized pieces, called "shards," that contain all the relevant information about the particular region of the genome such as reference data and information about SNPs.
Another feature provides several options for users to parallelize tasks. "With interval processing, users can split tasks by genomic locations and farm out each interval to a GATK instance on a distributed computing system, like the Sun Grid Engine or load sharing facility," the authors write in the Genome Research paper.
Other features let users combine several BAM files, combine multiple sequencing runs and other input files into a single analysis, and specify active intervals on the command line using common formats or a defined format.
GATK is available as open source software from the Broad’s website.
In addition to being used for the 1000 Genomes Project and the Cancer Genome Atlas, GATK has also been used in studies on mosquito parasites at the Broad and also in chimpanzee genomic variation studies.
“[Researchers] seem to really like [GATK] because the tools are very general and they do a variety of things that are useful for next-gen sequencing analysis,” DePristo said. “Having reliable software that can easily run on certain parts of the genome or genome wide is very important.”
Jan Aerts, a senior bioinformatician at the Wellcome Trust Sanger Institute, is using the GATK for exome resequencing studies.
“We look for single nucleotide polymorphisms that seem to appear only in cases and not in controls as well as for SNPs that only appear in a certain configuration,” he told BioInform via e-mail. “We can then look at the ones that are predicted to have an effect by discarding those that are intergenic or in any other non-coding region, depending on the disease we're working on.”
The first step, said Aerts, is coming up with a list of the SNPs by aligning reads to a reference genome. He said the difficulty with using this alignment is that it gives a lot of false positives that look like SNPs but really aren’t.
“This is where the GATK plays such an important role,” he said. “[The tool] allows you to correct for several artifacts and to normalize the data in different ways.”
Aaron McKenna, a researcher at the Broad and one of the developers of GATK, presented the tool to researchers at the Intelligent Systems for Molecular Biology Conference held in Boston earlier this month. He told BioInform via e-mail that the questions and comments he got after the presentation came from mostly new users and a few experienced ones.
“I had questions from new users who were interested in finding out how to get the GATK, and a couple of questions about using the GATK with different underlying file systems and execution managers like Hadoop” — an open source implementation of MapReduce, he said
DePristo added that researchers are also interested in the kinds of constraints the framework has.
“If you are doing local realignment, for instance, this really only can be run chromosome by chromosome so the nature of the problem you are solving to some degree constrains the parallelization you can adopt,” he explained. “The engine handles that by providing you with access patterns that can or cannot be parallelized and depending on which ones you use, you get the level of parallelization that the pattern supports.”
DePristo said his team plans to develop additional tools on the GATK framework.
“At the engine level, we are trying to scale it up. We are certain that we can do tens of thousands of samples simultaneously and this is just a memory and CPU-management challenge,” he said. “We are [also] focused on [developing analysis] tools for really making sure that the 1000 Genomes Project — whose goal it is to do 2,500 individuals — will actually succeed.”
To accomplish that goal, DePristo said his team will have to develop tools that identify insertions and deletions as well as structural variations in the genome. He also said the team plans to develop tools that can support “complex data relationships” in datasets.
“The classic trio study design that most [researchers] use — you sequence a mother, a father, and a child — provides you with lots of information [about] genotypes for figuring out which chromosome each variant lies on,” he said. “We are interested in moving more into that space because that’s a very good way to get more really gold standard datasets.”