This article has been edited to clarify Zhongwu Lai's and Jonathan Dry's job titles.
NEW YORK (GenomeWeb) – AstraZeneca scientists recently published details of methods and benchmarks for a variant calling tool that they developed to more accurately identify complex tumor mutations in DNA and RNA sequencing data even when tumor content is low.
The researchers, along with collaborators at the Harvard School of Public Health, described the so-called VarDict software in a paper published earlier this month in Nucleic Acids Research. They have also provided free Perl and Java implementations of the software in the Github repository.
According to the paper, the software can call single- and multi-nucleotide variants and insertions and deletions as well as more complex mutations such as structural variants and combinations of mutations which are often mishandled by existing tools. The paper also provides details of comparison tests pitting VarDict against standard tools like the Genome Analysis Toolkit's UnifiedGenotyper and HaplotypeCaller tools and FreeBayes. In multiple cases, VarDict showed "consistently improved performance and sensitivity, particularly for indel calling," according to the authors.
AstraZeneca researchers told GenomeWeb that they developed VarDict to validate their internal tumor sequencing pipeline. It will also serve as a benchmarking tool for a next-generation sequencing-based cancer assay that AstraZeneca is developing with Illumina and others. The researchers have also used the software in clinical studies focused on Lynparza (olaparib), the company's FDA-approved PARP inhibitor for ovarian cancer patients with BRCA germline mutations; and its EGFR inhibitor AZD9291, now Tagrisso (osimertinib).
For the Lynparza study, the challenge for researchers was whether or not it was possible to identify BRCA 1/2 mutations in data from formalin-fixed paraffin-embedded tissue samples, Carl Barrett, leader of AstraZeneca's oncology translational sciences group, explained to GenomeWeb in an interview. In the Tagrisso study, they were trying to pick up on low allele frequency mutations in circulating tumor DNA without having to downsample — randomly removing portions of DNA sequence. Downsampling improves computational performance but compromises the sensitivity of variant calling algorithms and increases the risk of missing important mutations.
Specifically, the researchers wanted to be able to identify insertions and deletions as well as combinations of insertions and deletions that are involved in loss-of-function mechanisms in BRCA 1/2 in FFPE samples, Zhongwu Lai, a principal cancer informatics scientist in AstraZeneca's oncology bioinformatics group and first author on the VarDict paper, told GenomeWeb.
These complex mutations are common in tumor samples. According to numbers that the researchers gleaned from the Breast Cancer Information Core's databases, about 6 percent of deleterious BRCA1 mutations and 9 percent of deleterious BRCA2 mutations are complex mutations with combinations of insertions and deletions. There are also large structural variants which account for about 10 percent of BRCA mutations, they said. These mutations also show up in EGFR. For example, in 572 samples from the Cancer Genome Atlas' lung adenocarcinoma cohort that the researchers analyzed, 25 percent of EGFR exon 19 deletions were complex mutations — about 1.4 percent of all patients.
It is challenging to search for these mutations in data from FFPE samples, Lai explained, because of the varying percentages of tumor and normal cells from sample to sample and disease to disease. FFPE DNA is also fragmented, which makes it unsuitable for Sanger sequencing and associated variant calling pipelines, and damaged, which could lead to artifacts or false positives.
Initial explorations of patients' genomes typically start with profiling biopsies collected at diagnosis and stored as FFPE blocks, Jonathan Dry, leader of AstraZeneca's global oncology bioinformatics group and the NAR paper's senior author, noted. Fresh samples are preferable but would require additional invasive surgery. That's one of the reasons why blood-based DNA testing is attractive, he said, because it offers a way to monitor tumor progression from a blood draw rather than multiple invasive surgeries. "The ability to accurately use the tumor genetics out of a blood test can offer huge value towards precision medicine and getting the right drug towards the right patient," he said. However, the low tumor content makes it difficult to search for variants using standard software tools without downsampling.
These are areas where VarDict has proved helpful and are among the reasons why AstraZeneca has made the software public. Essentially, VarDict does two things differently from other software solutions, Lai explained. It uses heuristic algorithms to look through soft-clipped reads for evidence of mutations, he said. Some other methods categorize these as misalignments but they could contain information about insertions and deletions which is why the method pays attention to them, he explained.
Also, VarDict represents complex variants, such as combinations of indels, as a single variant rather than as multiple variants. For ctDNA sequences, VarDict's alternative to downsampling is to use an in-memory data structure to represent the different kinds of variants in regions of interest. It then sequentially goes through the reads in the sample that map to the region of interest and updates the data structure as it identifies mutations.
According to the NAR paper, the researchers were able to call more actionable mutations in lung cancer with VarDict that with other methods. This is based on their analysis of TCGA exome and whole genome data collected from 230 patient samples covering 208 cancer genes. They compared VarDict's calls to those reported in published literature on the TCGA. Their results showed that VarDict called known driver genes in lung cancer oncogenes like EGFR and KRAS in 16 percent more cases. It also called more activating indels in EGFR than previous methods had done.
In addition to comparisons with tools like the Broad's GATK, the study also highlights results from a mutation calling challenge organized by the DREAM initiative, which showed that VarDict is as sensitive as Lumpy, a competing tool for calling structural variation.
Dry also told GenomeWeb that VarDict also ranked highly in the indel calling challenge of the DREAM initiative. He further noted that the highest ranked method for that particular challenge, which came from Bina Technologies (now owned by Roche), also uses VarDict along with several other methods to call variants. The Bina method, called SomaticSeq, combines calls from five algorithms including VarDict. SomaticSeq's developers noted in a Genome Biology paper published last year describing the method and results of the DREAM challenge that VarDict was the "best indel detector" when variant allele frequency was below 50 percent. "That was a really good independent benchmark," Dry said.
VarDict also does well with circulating tumor DNA analysis, according to the developers. They published a study last year in Nature Medicine that used VarDict among other tools to explore resistance mechanisms to AstraZeneca's EGFR inhibitor Tagrisso in non-small cell lung cancer. According to the researchers, VarDict was the only variant caller used for that study "that successfully completed in time — mostly overnight — with the desired sensitivity." Other callers either "failed to run" or were not sensitive enough to detect the resistance mutations at low frequency, they wrote.
This is not a criticism of existing algorithms, Dry stressed. The issue with many of those methods is that they were designed to handle large numbers of individual genomes, he said, but that's different from clinical settings where answers to questions rely on deep sequencing of specific regions of the genome. "A lot of the algorithms designed for those broad genomes can't handle that data," he said.
VarDict is freely available to the scientific community and AstraZeneca hopes to build an active community around the tool. According to the Github pages, since the software was added to the repository, researchers from places like Merck and St. Petersburg Algorithmic Biology Lab have accessed the software. The researchers have also heard positive comments from colleagues at conferences who have used the software, including researchers at Personalis. "We think by releasing it other people will identify issues with it and hopefully lead to even further refinements," Barrett said.