NEW YORK — An international team of researchers has examined how variations in sequencing approaches can influence the ability to accurately detect cancer mutations, providing guidance for the wider community. The team additionally developed a set of reference samples for benchmarking efforts.
Next-generation sequencing approaches are increasingly being adopted to analyze clinical samples, and being able to accurately detect somatic mutations and distinguish cancer-specific variants from analytical errors is needed to implement precision oncology approaches. But most previous analyses have focused on particular aspects of somatic variant calling in isolation, rather than the entire process.
Instead, researchers led by Leming Shi at Fudan University and Charles Wang at Loma Linda University School of Medicine examined the process in full. They found that read coverage and variant callers influence whole-genome sequencing reproducibility, while those and other factors affect whole-exome sequencing reproducibility. In a related paper, the researchers also established a set of reference call sets that can be used to benchmark somatic mutation calling approaches.
"We observed that each component of the sequencing and analysis process can affect the final outcome," Shi, Wang, and their colleagues wrote in their paper.
As they reported Thursday in Nature Biotechnology, the researchers used a range of different, real-world approaches to call somatic variants in a matched pair of breast cancer and normal cell lines. They sequenced the cell lines at six centers using three different sequencing platforms for whole-genome sequencing and three Illumina HiSeq models for whole-exome sequencing.
They noted some differences between the centers due to variations in strategy. Choice of library preparation kit, for instance, influenced the average percentage of mapped reads.
They additionally applied three different mutation callers to alignments generated by three different aligners. While there were no noticeable differences between the aligners on exome data, that was not the case for whole-genome data. Bowtie2, they noted, appeared to be more conservative than the Burrows–Wheeler Aligner or Novoalign. After fixing the alignment to BWA, the researchers examined the callers using an O-score that measures the reproducibility of repeated runs. Through this, they found Strelka2 had the best reproducibility in whole-genome sequencing runs but was the worst in exome runs, where MuTect2 instead had the best reproducibility.
They further examined which variables within the analysis process — not just caller choice, but also machine model, read coverage, G/C content, and more — contributed to O-score variation. Overall, read coverage and callers had the greatest effect on whole-genome sequencing run reproducibility, while additional factors like insert fragment size, G/C content, and global imbalance or GIV score — a measure of DNA damage — as well as their interactions, affected exome run reproducibility.
In a related study, also in appearing in Nature Biotechnology, the researchers generated reference samples and call sets specifically for benchmarking somatic variant calls, noting that existing sets are for germline mutation detection. For this, they sequenced the genomes of a breast cancer cell line and a matched lymphoblastoid cell line using a range of short-read sequencing platforms at seven centers.
They generated calls from 21 replicates using three aligners and six mutation callers, which they then combined and assigned a confidence level. In all, the somatic reference call set includes 2.48 billion base pairs.
They then validated the call set by mixing tumor and normal cell lines at different proportions prior to sequencing. Overall, they achieved 99.93 percent and 97.5 percent validation rates for SNVs or indels in the high confidence and medium confidence sets, respectively. PacBio long-read sequencing further confirmed 99.3 percent of the SNVs and 98.5 percent of the indels in the high-confidence set.
According to the researchers, this call set can be used to evaluate next-generation sequencing pipelines, adding that the genomic DNA samples they used have been preserved and could be used to develop standard reference materials.
Shi, Wang, and their colleagues additionally made a set of recommendations for detecting cancer mutations via next-generation sequencing in their reproducibility study. For example, they suggested certain read coverage levels for varying levels of tumor content, both for whole-genome and whole-exome analyses, as well as when different mutations callers might best serve users.
"Detection of cancer mutations is an integrated process," they added. "No individual component can be singled out in isolation as being more important than any other, and specific components affect and interact with each other."