Researchers from the Broad Institute have updated their existing Genome Analysis Toolkit to include a pipeline to help improve variant calling for next-generation sequence data.
GATK now includes a data analysis pipeline for calling high-quality variants in next-generation sequencing data produced on multiple platforms using diverse experimental designs.
In a paper that was recently published in Nature Genetics, the Broad team, along with colleagues from Brigham and Women's Hospital, Harvard Medical School, and Massachusetts General Hospital, described how they used the framework to analyze variations in three datasets from the 1000 Genomes Project.
The process uses BWA and MAQ for the initial raw read mapping and then utilizes open source informatics components to do local realignment around insertions and deletions; base quality score recalibration; and SNP discovery. It also includes machine learning tools to separate true variants from sequencing artifacts.
In the Nature Genetics paper, the team pitted the GATK variant-calling pipeline against Crossbow, a package that uses Bowtie for read mapping and SoapSNP for SNP detection, and found that the Crossbow call set had lower specificity than the GATK pipeline.
They suggested that GATK's local realignment and base quality recalibration components would be "likely to improve" Crossbow's results.
This week, BioInform spoke with Mark Depristo, who heads the genome sequencing and analysis group at the Broad, about the specific components that make up the pipeline and ongoing improvements to GATK. What follows is an edited version of the conversation.
Can you describe the development process? Was it pretty straightforward or did you have to go through multiple rounds of tool testing?
It took years of many people’s work. It was not straightforward. Mostly we were experimenting with not only the actual individual aligners, but more of the tools that did the low-level data processing and the variation calling and then filtering the variants that are real. All were large-scale experimental projects [over] a very long time.
Did you have to tweak the components in the GATK to make them more flexible to the multiple sources of sequence data?
Actually it was the reverse. The GATK was designed as a framework for working with many types of sequencing data.
What makes this framework and the tools you have developed so flexible?
There are a variety of factors. The GATK is a very powerful framework for writing tools for next-generation sequencing analysis and, as part of that, it provides a lot of infrastructure that can be very difficult to build on your own. We are able to develop pretty sophisticated algorithms that incorporate lots of different types of information to figure out the actual right answer for sequencing.
We also have a good group of people at the Broad so there is actually a lot of work involved in this process. In many ways, that’s the secret sauce — to have very good people.
Can you talk about specific components in the GATK that are part of the pipeline?
Some of the key early tools were the base quality score recalibrator, which builds the low-level error model empirically for the machine. That is followed by an indel realigner that is repositioning reads that are stuck to each other; that’s actually a substantial component. There is the SNP caller and an indel caller and after that, a lot of the pieces that are really critical are the variant quality score recalibrator, which builds the machine error model across all the SNPs, and also tools to evaluate the quality of the variation that’s going to be found. Those altogether are really critical.
The GATK has been in use for some time now, so why is this paper just being published?
We had a view that we wanted to solve the end-to-end problem. We wanted to do this for a variety of experimental designs and that’s a big goal. It’s taken a long time to get there. We have not been motivated by the smallest publishable unit but really have wanted to describe a unified intellectual framework to do it all.
It’s not that any individual component is not going to be upgraded at some point or replaced by something better, but it’s certainly hard to imagine a radically different approach than the one described in the framework. I think that’s what we will be doing for a long time — we will just get better and better at each component.
Are there any specific areas for improvement?
Every item in there is being actively worked on. To some degree our problem isn’t so much that we got to an end point, but that we got to the first milestone which gave reasonably good results, but all the pieces are still under active development.
What’s described in the paper is nearly a year old in the way we think about the problems. Publications are just very slow.
Do you expect that as new sequencing technologies emerge, you will have to develop some entirely new tools to include in the analysis pipeline?
We might. The tools are very general, so we have no problem running them on a PacBio machine output or anything else really. It’s all just sequencing data at the end of the day.
Were there any surprises in putting this pipeline together?
I think what we were quite surprised with is just that what ultimately matters the most is building good error models and doing this based on empirical data. That's really the challenge.
You can try to build an error model out of somebody’s head [by identifying] what the machine does and how it makes mistakes and then encode that in some particular way. I think we have taken a very empirical approach. We defined the parameters, what kind of errors can be made, and learned about the relative [error] rates of the machines by looking at other datasets. This has been a huge opportunity and I think it’s one of the key things that made the whole process work very well.
You mentioned that the work described in the paper is nearly a year old. What's different about the framework now?
A variety of things. The indel realigner can accept sites that are known to be insertion and deletion polymorphisms and incorporate that in its modeling. The SNP caller is radically better; the underlying mathematics are radically better; all the annotations used to identify actual errors are massively better than they were; all the underlying likelihood calculations are better; even the sequencing data is much better.
What’s been the response from researchers who have used the framework?
I think people are very pleased. The biggest concern people have is that it doesn’t run on radically different data types. It’s designed to do variation detection in germline DNA, so it’s not really optimized for cancer applications and it does not work with RNA sequencing or anything like that.
In principle, the machinery can run but we haven’t done any of that work. There are whole areas of analysis of different data types that we haven’t addressed.
Do you think you might try to optimize the pipeline for these new areas?
We might. Right now germline variation is still a big issue and I would say among the most important in clinical genetics. So it’s certainly not the case that things are done or that the problem is not an important one.