Skip to main content
Premium Trial:

Request an Annual Quote

Researchers Assess Variant Calling Pipelines Using Genome in a Bottle 'Gold Standard'


NEW YORK (GenomeWeb) – Researchers looking to develop an accurate sequencing pipeline for cancer genomes have compared combinations of aligners and variant callers to analyze datasets generated by the National Institute of Standards and Technology to figure out which combination performs the best.

The University of Texas, Austin and Yonsei University, Seoul team tested a combination of three aligners and four variant callers on 12 datasets generated by the NIST Genome in a Bottle Consortium for one individual, publishing its findings recently in Scientific Reports.

The researchers found that the different pipelines have different biases, but looking at overall performance for SNP calls from Illumina data sets, the BWA-MEM aligner and Samtools variant caller outperformed the others.

NIST published Genome in a Bottle reference datasets earlier this year. The data is essentially a set of true calls on one genome that researchers can use to gauge the accuracy of different sequencing and bioinformatics pipelines.

Insuk Lee, an assistant professor of biotechnology at Yonsei Universityand an author of the recent paper, told GenomeWeb that before NIST's reference datasets, judging accuracy of NGS technologies and bioinformatics pipelines was difficult, since there was no truth set to compare the results to.

Lee added that the group performed the comparison study because lead author Sohyun Hwang had been trying to develop a cancer analysis pipeline, but there had previously been no way to benchmark the results.

In the study, the researchers compared three aligners — BWA-MEM, Bowtie2, and Novoalign —  and four variant callers, — GATK-HC, Samtools, Freebayes, and the Torrent Variant Caller (TVC) — on 12 datasets. The data sets included five whole-genomes analyzed on Thermo Fisher's Ion Proton, the HiSeq 2000 or the HiSeq 2500; as well as seven exomes analyzed on either the HiSeq 2000 or HiSeq 2500 using one of two exome capture kits, Nimblegen's SeqCap EZ Human Exome or Agilent's SureSelect.

Lee noted that the group only had one sample sequenced by the Proton, and it was at a low sequencing depth, so comparing the two sequencing technologies was not possible.

Overall, Lee said, the study "demonstrated that clinical decisions based on patient genomes could be different depending on what sequence analysis tool is used," which could be a "major hamper in the practical application of clinical genomics."

In the study, the researchers wanted to be able to evaluate variant calling pipelines using multiple datasets , generated with different exome capture methods, coverage, and sequencing technologies, so they could draw a conclusion that could be generalized across multiple genomes.

Other researchers have also used the NIST reference material to compare aligners and variant callers, including a group from the University of Nebraska. But according to the authors of the recent paper, that group looked at just one dataset. In addition, the researchers in the current study decided to use a different performance metric than the Nebraska group, known as "area under a precision-recall curve (APR)", which takes into account the trade-off between positive predictive value and sensitivity.

The team first downloaded sequence datasets of the NA12878 individual that were generated by the Genome in a Bottle consortium on the HiSeq and Proton platforms.

They used seven datasets from the HiSeq 2000, four from the HiSeq 2500, and one from the Ion Proton. At the time of the study, only one whole-genome dataset had been generated on the Proton. 

Next, the team ran each of its pipelines on the datasets and generated APR scores for SNPs and indels.

For SNP calls from Illumina data, the researchers found that the BWA-MEM and Samtools pipeline showed the best overall performance, with an average APR of .998. The Freebayes variant caller also had good SNP calling performance for all the different aligners on the Illumina platforms. For SNP calls from the Proton data, Samtools outperformed all the others, including Thermo's TVC method.

Lee said one interesting finding was how much one pipeline's performance could vary across different datasets. Each dataset had a different "best" pipeline, he said, indicating "that we may also need to investigate how sequencing depth and exome-capture protocol affect variant calling for each variant calling pipeline."

In addition, he said, each caller had unique biases for SNP calling. The researchers found three different types of biases: ignoring the reference allele, adding the reference allele, and other SNP calling errors. Ignoring the reference allele results in a homozygous SNP call when the actual SNP call should be heterozygous. Adding the reference allele is essentially the opposite — a heterozygous call that in fact is homozygous in the gold standard.

In the datasets, there were a total of 19,851 erroneous SNP calls, of which 7,290 were errors that ignored the reference (IR) and 9,917 errors that added the reference (AR).

The Freebayes caller skewed toward IR errors, while GATK-HC and Samtools both had more AR errors, suggesting that "we need to be more cautious about homozygous SNP calls using Freebayes and heterozygous SNP calls by GATK-HC and Samtools," Lee said.

For indels, the pipelines showed even larger performance differences. The GATK-HC pipeline without an aligner performed the best for the Illumina platform, while Samtools performed best for the Proton.

For both SNP and indel calling, the variant caller had a bigger effect on accuracy than the aligner, the researchers noted.

Looking at concordance between the pipelines, the researchers found that for the Illumina datasets, the pipelines were more concordant than previous reports. For instance, previous studies have reported only 57 percent and 70 percent concordance levels between different variant calling platforms for Illumina sequence data. However, in this study, the researchers found around 92 percent concordance. The difference could be attributed to different versions of software being used, the authors noted.

Lee added that having the Genome in a Bottle reference datasets should help improve NGS technology and the various algorithms used for alignment and variant calling. Before having a reference dataset, researchers still conducted comparisons of various methods, but all that could be concluded was by how much the various techniques differed from each other, Lee said. "Now, analyzing differences between reference variants and called variants will provide important clues to improve NGS and analysis algorithms for variant calling," he said, which is ultimately a "critical step toward precision medicine."