NEW YORK (GenomeWeb) – Investigators from sequencing firm Personal Genome Diagnostics with collaborators at Johns Hopkins and Memorial Sloan Kettering, have described a novel machine-learning tool that they developed to automate and improve NGS variant calling.
They claim that the new caller enables a hands-off and extremely precise assessment of whether variants are legitimate or merely artifacts of the sequencing process.
In a study last week in Science Translational Medicine, the authors described the tool and reported data from a variety of different tests, including comparisons with other variant assessment tools and with commercial sequencing results, and a reanalysis of TCGA data from nearly 1,400 individuals.
John Simmons, director of translational science and diagnostics at PGDx, said that the development of Cerebro was a necessary step in the company's plans to develop FDA-approved sample-to-answer kits that will allow it to globally disseminate its various NGS cancer assays.
This includes tests run on both tumor tissue and on blood samples, and both individual mutation detection and analysis of the number of mutations in a patient's sample to calculate tumor mutational burden, or load, which has been rapidly expanding as a clinical biomarker for immunotherapy response prediction.
"This is one of the steps we had to take in order to take the PGDx expertise and put it in a box to send to labs around the world," Simmons said. "A big surprise to me in this work has been discovering how often humans are still used in many clinical labs to actually manually curate variants after they are called by some sort of bioinformatics tool."
In other words, even if the sequencing technology, the chemistry, and the variant calling bioinformatics are standardized into a kit, that still leaves a final step of double checking those variant calls, which introduces variability — either between various commercial players offering NGS services in the field today, or in PGDx's case from user to user of its planned products.
It's a clear roadblock for brining a kit through the FDA, Simmons said. "Going through variants individually and looking up artifact patterns to pick out the bonafide mutations from artifactual calls … that's just really hard to regulate."
Enter Cerebro, which replaces these kinds of manual final checks with an approach called random forest classification, which assesses the very large spectra of potential artifact patterns, looking at all the ways a particular variant could be wrong and generating a confidence score for each candidate.
Investigators wrote that they trained the model using a normal peripheral blood DNA sample, for which exome regions were captured and sequenced twice. More than 30,000 somatic variants at mutant allele fractions from 1.5 to 100 percent were then introduced in silico into one of the NGS data sets to provide the classifier with a training set of "tumor-specific" mutations. The model was also fed a "real-world representative" set of more than 2 million NGS errors and artifacts that might otherwise be mislabeled as variants.
The hoped for end result is that "when you integrate the Cerebro tool in the suite of tools we have put together as software, you get an analytical report at the end that has called these variants in a highly accurate and reproducible way," Simmons said. "It's something we realized we had to do to have an FDA-cleared product, especially if it is going to include something like" tumor mutational burden.
In their study in STM last week, investigators compared Cerebro results to the output of other existing methods for somatic mutation identification — using both simulated data and experimentally validated whole-exome and targeted gene sequencing data from clinical cancer specimens.
The team performed some initial studies using NGS exome data from a set of six normal cell lines with in silico somatic mutations spiked in, which allowed them to study how various methods performed in simulated low-purity tumors to calculate false-positive rates for the various tools and to examine the sensitivity of the various tools for specific mutation types and allele frequencies. "In all cases, the location, type, and level of in silico alterations were different from those used in the training of the Cerebro algorithm," the authors wrote.
They also analyzed five matched tumor and normal specimens for which somatic mutations had been previously identified and validated through independent whole-exome sequencing.
Overall, the authors wrote, "Cerebro had the highest overall accuracy compared to other methods," represented by a positive predictive value of 98 percent versus 34 percent to 92 percent for the other tools.
The team also looked at whether using Cerebro could increase the accuracy of mutation calling in large-scale cancer genome sequencing efforts like TCGA, reanalyzing a set of 1368 TCGA paired tumor-normal exomes, with a focus on tumors relevant to targeted and immunotherapies.
"The total number of somatic mutations measured across the various tumor types was largely similar to previous analyses of TCGA exomes," the team wrote. However, about 10 percent of the calls detected by Cerebro were apparently missed by TCGA, whereas 16 percent of alterations identified by TCGA were not considered somatic alterations by Cerebro.
Importantly, in regard to the growing excitement around tumor mutational burden, or load, as a biomarker for immunotherapy response, the PGDx analyses found that when Cerebro was used to reanalyze individual TCGA tumors, mutational loads differed by as many as 390 fewer or 729 more alterations compared to the original calls.
Focusing further on TMB, investigators also studied paired tumor-normal exome data from two recent drug studies — one of response to anti– PD-1 therapy in 34 NSCLC patients and another of anti–CTLA-4 therapy in 64 melanoma patients.
Across the NSCLC cohort, 9,049 mutations were identified originally compared to 6,385 using Cerebro. In the melanoma cohort, Cerebro yielded 32,092 mutations compared to 25,753 in the initial study.
Among all mutations in the NSCLC set, only 48 percent were concordant between Cerebro and the original report, and in the melanoma group 62 percent of mutations matched.
The researchers performed an "in-depth characterization of mutations that were identified in the original publications that would be considered false positives using Cerebro" and found that the vast majority of such calls "could be attributed to systematic issues such as limited observations of the mutation in distinct read pairs, poor base quality at the mutation position, and inaccurate alignment."
Most interestingly, when the team looked at patient outcomes, calculating TMB based on the Cerebro results resulted in improved response prediction compared to the initial studies. Cerebro results would have reclassified four NSCLC patients from high TMB in the original study to low TMB, authors added, and these four patients had an average progression-free survival of 3.25 months.
For melanoma, the machine-learning approach would have reclassified nine low-TMB patients as high-TMB. These nine patients had an average overall survival of 40 months.
The study authors also performed head-to-head comparisons of PGDx's own clinical sequencing pipeline with or without the Cerebro machine-learning approach and compared PGDx mutation detection using Cerebro to three other assays: the Thermo Fisher Oncomine Comprehensive Assay, the Illumina TruSeq Amplicon Cancer Panel, and Memorial Sloan Kettering's MSK-IMPACT, concluding that the Cerebro PGDx reads were more accurate than those of the other clinical NGS platforms.
Although Simmons highlighted the Cerebro study in light of PGDx's individual commercial goals, the various platform comparison results also reflect ongoing questions in the field about the comparability or concordance of various commercial assays.
The last year has been a boon period for broad sequencing panels in the clinic with multiple commercial firms now selling comprehensive NGS tests, in addition to numerous academic laboratories, and the landmark decision by the Centers for Medicare & Medicaid Services this March extending coverage for FDA-cleared or -approved NGS companion diagnostics for certain advanced cancer patients.
But as commercial competition has become more pressured, debates have emerged in the field regarding variation or discordance between different available tests, including a recent example in which investigators from Johns Hopkins compared results of blood tests from PGDx and its competitor Guardant Health, finding "very low congruence for same patient-paired samples."
Stakeholders calling for standardization or harmonization efforts have highlighted a range of factors that can play a role in discordances between different available assays — everything from pre-analytical sample preparation steps to the final step of post-sequencing variant calling.
Simmons said that there isn't transparency about which tools commercial labs and companies are using to recheck called variants, although the publicly available tools that the study authors compared Cerebro to are most likely incorporated somewhere in the field.
Because of this its hard to know how much or how little the double checking of variants may contribute to discordance like that seen in the Hopkins study, for example.
"The fact that we don't know is perhaps the bigger issue," Simmons said. "As we think about NGS becoming more mainstream in oncology, we have a need for standardization [and] there are some kitted solutions, but we also see clinical labs taking a grocery basket of various RUO chemistries and sequencers, and algorithms and pipelines, and duct taping it all together, so its very hard to know where those discordances arise."
One aspect of the Cerebro study results that highlights the issue of concordance particularly strongly, Simmons said, is the calculation of TMB or mutational load.
"As we see this becoming increasingly more relevant as a clinical biomarker, we see the difficulty of developing and validating assays increase. It's not a single hotspot. Your 'call' around a TMB threshold is derived from variants from hundreds of genes, so that clinical application has really upped the ante, if you will," he explained.
"That's why we focus on that in the publication. Because what were previously perceived as small allowances of artifact in non-hotspot genes … when you multiply that across 500 genes and make it into a score that you are depending clinically on, you can't make those allowances anymore."