NEW YORK (GenomeWeb) – Researchers from the Informatics and Biocomputing arm of the Ontario Institute for Cancer Research and elsewhere have published an algorithm that they claim can distinguish between somatic and germline single nucleotide variants in next-generation sequencing data from tumor tissue in the absence of normal controls.
According to the OICR researchers, when presented with data from roughly 1,600 samples across six different cancer types, the so-called Identification of Somatic Mutations Without Matching Normal Tissues, or ISOWN, software correctly classified between 95 and 98 percent of somatic mutations with F1-measure ranges from 75.9 to 98.6 percent. They published their results in Genome Medicine last week.
Irina Kalatskaya, a project manager and computational biologist in OICR's Informatics and Bio-computing arm and the lead author of the paper, and her colleagues began developing ISOWN roughly four years ago. At the time, they were looking to analyze 1,500 samples from the Tamoxifen and Exemestane Adjuvant Multinational (TEAM) clinical trial, which compared the effects of the drugs in women with hormone-sensitive early breast cancer in order to identify a biomarker that could predict which patients would respond well to the treatment.
"One of the challenges of this project was that we got access to the patient FFPE tissues but matched normals were not available due to a very strict IRB [so] basically we had to find a way to call somatic mutations without a matched normal," she explained. "Initially we thought we [could] use publicly available datasets from dbSNP and COSMIC … but we realized that the publicly available resources are not sufficient to just call somatic mutations because the false-positive rate is still really high. So that's how we ended up designing ISOWN."
ISOWN uses supervised machine learning methods to classify mutations in tumor data as germline or somatic. "It's based on 10 features," Kalatskaya said. "Half of the features are from publicly available resources like COSMIC, dbSNP, ExAC, Mutation Assessor, and Polyphen, and half of the features are based on the internal properties of the data like variant allele frequency, sample frequency, sequence context, and some others."
To train and validate the ISOWN algorithm, the researchers downloaded several VCF datasets from the Cancer Genome Atlas from patients with different forms of cancer, including kidney renal clear cell carcinoma, pancreatic carcinoma, and breast invasive carcinoma. They also downloaded BAM files from patients with esophageal adenocarcinoma from dbGAP, extracted the raw reads, and called and annotated single nucleotide somatic variants from the files.
The most significant finding is that classifying tumor mutations in the absence of data from a matched normal is doable, Kalatskaya said. "If you are in a situation [where] you are using FFPE samples from bioarchives or pathology archives where the normal tissue is not available … it's doable," she said. "And the accuracy is comparable to the traditional gold standard for tumor-normal somatic mutation pipelines."
Specifically, seven ISOWN classifiers that were tested on 1,000 training datasets, each containing 700 random somatic mutations and 700 germline variants, had a less than seven percent false positive rate each for most of the cancers studied, according to the paper. In another test, ISOWN correctly classified silent coding mutations on tested tumor types that had high and moderate mutation loads, though its error rates were high in tumors with low mutational loads.
The researchers also reported that when they tested ISOWN on cancer cell line data, it had an average recall across datasets of 85 percent and 63 percent precision. In another test using data from the TEAM trial, they found no significant differences in gene mutation frequencies between ISOWN-processed samples and previously published breast cancer mutation frequencies, according to the paper.
Moving forward, ISOWN's developers plan to develop a second iteration of the software. The current iteration of ISOWN only classifies single nucleotide variations and so the next big step for the group will be to develop a version of the pipeline that can classify indels. Details of that pipeline will likely be published in a second paper.
Meanwhile, the current iteration of ISOWN is ready for general use. "It is ready for fresh frozen sample, FFPE sample, [and] we tested it on cell lines," Kalatskaya said. Currently at OICR, "for projects where we don't have a matching normal or we don't have a matching normal for all samples, we do use this pipeline." Potential use cases include retrospective studies using data from past clinical trials, pathology archives that did not collect matched normal tissue from patient participants at their onset, or data from cancer cell lines, which may have no information on donors' normal genes.
Her group also hopes to partner with hospitals to test ISOWN in diagnostic settings. At least one study has argued that sequencing both tumor and normal tissue from the same patient is essential for accurately identifying clinically actionable tumor mutations. That study, published two years ago by researchers at Johns Hopkins in Science Translational Medicine, found that three quarters of patients whose tumors they sequenced had alterations in actionable genes, and three percent carried previously undetected cancer predisposition mutations. But analyzing only the tumor data without a matched normal resulted in many false-positive alterations, including in actionable genes.
Kalatskaya and her colleagues do not disagree. They only recommend using ISOWN in cases where it is simply not possible to obtain a matched normal sample for fiscal or other reasons. "If you do have access to normal tissue and you do have resources and finances to sequence normal tissue, please use the conventional method," she said. However, "if you are in a situation where it's not available, then you have a solution."