NEW YORK — Researchers have begun to tease out previously hidden somatic mutations in non-unique parts of the genome of cancer samples, including coding regions and regulatory elements.
Throughout evolution, parts of the human genome have undergone duplications and rearrangements, leading different sections to be similar to each other. Because many current sequencing tools rely on short-read technology, telling those similar sections apart to call mutations can be tricky. According to Maxime Tarabichi, a postdoc at the Francis Crick Institute, about 10 percent of the human genome is non-unique at the scale of the length of short reads.
"To be able to assign the mutation to a specific locus — for example, a cancer gene's coding sequence — [common mutation calling algorithms] scan the sequences after they have been aligned to the genome, position by position at each of the three billion genome loci. And they discard any short sequence that aligns ambiguously at any given locus, together with all potential mutations they might carry," she wrote in an email. "This means non-unique regions are recurrent blind spots for the identification of mutations."
As they reported on Monday in Nature Biotechnology, she and her colleagues developed a list of regions that are known to have high sequence similarity, a so-called "genetic thesaurus," and an algorithm that used that thesaurus to uncover mutations within those non-unique regions. When they applied their approach to a set of pan-cancer genomes, the researchers uncovered hidden mutations in about 1,700 coding sequences and in thousands of regulatory elements. These mutations affected known cancer genes as well as immunoglobulins and other highly mutated gene families.
The researchers trained a machine learning approach to use their genetic thesaurus to annotate mutations found in short-read datasets that map ambiguously. For most mutations, Tarabichi said they can use non-ambiguous anchor points in the data to map the mutation back to its location, but even if they do not know exactly where the mutation is, they can still begin to characterize it.
They applied this approach to a set of 2,658 cancers from the Pan-Cancer Analysis of Whole Genomes dataset to uncover mutations in 1,744 coding sequences as well as thousands of mutations in regulatory elements. The researchers estimated that their approach had a median false discovery rate per sample of 7 percent and a median false negative rate per sample of 9 percent. Using an orthogonal short-read and linked-read sequencing approach on an additional cancer sample, they reported a validation rate of more than 80 percent.
In cancers, most mutations are passenger mutations that have no effect on tumor growth or disease progression, and the researchers noted that this is also the case for most of the mutations they identified through the genetic thesaurus approach. But some mutations they identified appear to affect the protein-coding sequences of known cancer genes.
"Intriguingly, we found many mutations affecting the protein sequence of bona fide cancer genes. We also detected excess of protein-changing mutations in new candidate cancer genes, many in members of gene families with high sequence similarity," Tarabichi said. "Some of these genes had already been implicated in cancer, but according to classical mutation callers, their coding sequences seemed to never mutate."
They uncovered, for instance, recurrent mutations in PIK3CA and KMT2C, as well as mutations affecting the breast cancer-linked gene ANKRD30A and the TPTE gene, which is linked to the PTEN pathway. Other mutations affected regulatory regions, including within the promoter regions of members of the immunoglobulin family.
Tarabichi noted that long-read sequencing approaches will also begin to address the mapping issues but added that most of the large genomic databases to date have been developed with short-read sequences, so they can now apply their tool to these.