NEW YORK (GenomeWeb) – A new computational tool developed by an international team of scientists offers an improved approach for identifying pathogenic variants in non-coding regions, and could help the biomedical community detect these mutations faster and with greater accuracy, according to its developers.
The method, called Genomiser, builds on the Exomiser software, which was developed by researchers in the computational biology and bioinformatics group of the Institute for Medical Genetics and Human Genetics at Charité for identifying, annotating, filtering, and prioritizing likely disease-causing variants in coding regions. As described in a recent paper in the American Journal of Human Genetics, Genomiser uses some of the same methods as its predecessor to score small non-coding variants of less than 25 nucleotides and combines them with allele frequency, regulatory sequence, and phenotypic relevance, among other bits of information to predict pathogenicity.
According to its developers, it fills a gap for more accurate tools for identifying non-coding pathogenic mutations particularly in the context of Mendelian disorders. Of more than 100,000 Mendelian-disease causing variants that have been added to the ClinVar database, "the vast majority affect coding sequences or conserved splice sites," the researchers wrote in AJHG. Meanwhile, genome-wide association studies have identified more than 10,000 associations between diseases and single-nucleotide variants, many of which are in non-coding portions of the genome. However, mutations in non-coding mutations represent only a "tiny minority" of published mutations in Mendelian disease cases, they wrote.
Yet gaining a better understanding of what's going on in the non-coding regions is crucial, Peter Robinson, a professor of computational biology at the Jackson Laboratory for Genomic Medicine and one of the authors on the paper, told GenomeWeb during a recent conversation. Currently, about 25 to 40 percent of patients with suspected Mendelian diseases receive a diagnosis possibly because current diagnostic screening methods focus largely on the coding portion of the genome and may be missing important variants that aren't located there.
Furthermore, most bioinformatics tools that have been developed to date are designed to predict potentially pathogenic variants in coding portions of the genome. "These methods implicitly take effects on the protein into account and [so] obviously cannot work for non-coding variation," Robinson said. Some existing methods evaluate the potential for single-nucleotide variants to cause disease or affect genetic regulation but these methods are designed specifically for detecting non-coding variants associated with Mendelian disease, according to the developers. Programs like Polyphen, for example, score the likelihood that a given protein-coding sequence alteration will cause disease but there is not a similar solution available for scoring non-coding mutations, Robinson told GenomeWeb.
There are existing methods that identify functional non-coding variants in general but these do not identify these variants in the context of Mendelian disease, he and his co-authors noted in AJHG. For example, two methods developed in 2014 by separate research teams from the Wellcome Trust Sanger Institute and the European Bioinformatics Institute, and from the University of Washington and HudsonAlpha Institute for Biotechnology — GWAVA and CADD, respectively — offer complementary approaches for predicting pathogenicity of variants in both coding and non-coding parts of the genome.
However, both methods are designed to "just assess deleteriousness of variants, and do not have a framework for embedding the variant assessment into an assessment of phenotype (clinical relevance) or genetic regulation [such as] predicted enhancers [or] promoters," Robinson explained to GenomeWeb in an email. "It is the combination of variant analysis and phenotype analysis that in our opinion really makes the Genomiser work well for practical cases."
Furthermore, these methods are "generic" in the sense that they do not identify specific classes of variation, according to Robinson. In contrast, Genomiser includes a machine learning method called the Regulatory Mendelian mutation (ReMM) framework which is similar in intent to CADD and GWAVA but was trained on a specially curated collection of 453 non-coding variants known to be linked to Mendelian diseases, he said. Also, "our method used some machine learning 'tricks' to help overcome the imbalance problem of this kind of dataset," Robinson said.
The ReMM algorithm scores each position of the non-coding genome based on its predicted pathogenicity in Mendelian diseases and then ranks mutations accounting for information on phenotypes, variants in coding and non-coding regions, and existing published gene-phenotype associations. Basically, "you take the data from a genome sequence, which is about 4.5 million variants, enter that, and you enter a description of the phenotype using Human Phenotype Ontology terms ... [and] the program searches through these variants and gives you a list of candidates," Robinson explained.
According to the results of comparison studies reported in the paper, Genomiser outperformed more general variant pathogenicity scoring methods in terms of identifying Mendelian disease-associated variants. Benchmarking experiments for the software performed using more than 10,000 simulated rare disease genomes — over four million variants — showed that overall Genomiser was able to correctly prioritize the causal regulatory variants as the top candidate in 77 percent of cases when it had access to the full phenotypic profile of the disease. Its performance dipped to 68 percent when the experiment was repeated with the sort of phenotypic profile that would more likely be available in actual clinical settings, which is "pretty remarkable" for non-coding variants, according to Robinson.
Those numbers were still significantly better than those from a comparable tool for calling non-coding variants in Mendelian disease called Phen-Gen which was able to prioritize the right variants in only 19 percent of cases with the full phenotype and only 14 percent of cases with the limited profile. "Even when looking at the top 100 variants returned by Phen-Gen, the causative one was identified in only 31 percent — 34 percent of samples," the researchers wrote.
Genomiser also performed slightly better than the CADD software when it had access to both the full and reduced phenotype. CADD accurately prioritized variants 71 percent of the time with full phenotype and 61 percent of the time with the reduced phenotype, according to the paper. The researchers also tested Genomiser and CADD's classification skills with no phenotype information included. With no phenotype data, Genomiser's performance dropped significantly to 23 percent while CADD failed to rank any of the causative variants as the top scoring hit in any samples without phenotype data.
Another experiment reported in the paper demonstrates Genomiser's ability to use information about variants from coding regions of the genome in pathogenicity predictions for non-coding variants. In this particular case, the researchers tested Genomiser on 22 published cases where samples contained a regulatory causal mutation and a coding variant or splice site. According to their results, Genomiser was able to accurately rank the causal gene in 84 percent of the samples.
Genomiser can process a whole-genome in around 10 minutes on a standard desktop computer, according to the paper. Also, it is modular so researchers can swap in different pathogenicity prediction methodologies in place of ReMM if they choose to. It is freely available as part of the Exomiser software suite. Genomiser is one of the tools that will be used to analyze rare disease data from the UK's 100,000 Genomes Project and it is also being used by researchers at the National Institute of Health's Undiagnosed Disease program, Robinson said.