NEW YORK (GenomeWeb) – A team led by researchers from the University of Toronto has published a paper in Science describing a computational model to predict how much of an impact genetic variants have on RNA splicing and demonstrating how they used the method to identify variant-driven splice alterations involved in neurological disorders and cancer.
According to the paper, the researchers used machine learning — deep learning techniques, specifically — to derive a computational model that "extracts sequence features, or cis-elements" from input sequences and uses these to predict which transcripts will be generated in a given cell type.
"Regulatory cis-elements comprise a significant portion of the human genome, and form the ‘regulatory code’ that directs gene expression, depending on cellular conditions," the researchers wrote. Thus, computational models that can "read the code for any gene and predict relative concentrations of transcripts" could conceivably be used to "identify variants that lead to misregulated gene expression and human disease."
"In order to understand how mutation causes disease you need to understand how the mutation causes cellular biochemistry to go wrong," Brendan Frey, a professor in UT's departments of electrical and computer engineering, medical research, and computer science and one of the paper's authors, explained to GenomeWeb.
Instead of trying to accurately model the biochemical state of the cell, "[we] used machine learning to essentially learn a computational model" that provides an output that "mimics the biochemistry of the cell," Frey added. In other words, when the model is fed a bit of DNA sequence, its spits out a prediction for which transcripts will be generated from that sequence, he said, and it does so in a cell type-specific manner, meaning it can predict what transcripts would be seen specifically in brain tissue versus in heart tissue, for example.
Frey and his students used their method to explore alternative splicing in polyadenylated mRNA. For each exon, Frey explained, they measured the fraction of transcripts generated from the gene that contained the exon in question — the percent inclusion level or PSI value as it's referred to in the paper. It's an important biochemical measurement to look at, he explained, because alternative splicing can include or exclude an exon in the transcript, which could significantly change the final transcribed protein.
The input to the machine learning algorithm is the actual exon sequence, proximal DNA sequences, and information about source tissue, Frey said. To explore how mutations affect splicing, researchers feed the relevant reference genome sequence and the mutated version of the sequence into the model and then look at the difference in the predicted PSI values for both cases. "So if our computational model says for the reference genome DNA sequence [that] the PSI level should be 80 percent, but for this mutated DNA sequence it's going to drop down to 20 percent, then that gives us an indication that that mutation could be problematic," Frey said.
According to the Science paper, the researchers trained their model on more than 1,000 sequence features that were extracted from each of more than 10,000 exons with evidence of alternative splicing. They used the method to score the effects on splicing regulation of more than 650,000 single nucleotide variants — at least 100,000 of which had been linked to disease by previous studies — that they mapped to intronic and exonic sequences which contained regulatory code for about 120,000 exons in approximately 16,000 genes.
The model, they said, revealed "widespread processes" by which these variants cause abnormal splicing. They found, for example, that intronic disease mutations that were more than 30 nucleotides from a splice site alter splicing nine times more often than common SNVs in the same regions do. They also found, according to the paper, that missense SNVs that have some effect on protein function are nearly six times more likely to alter splicing. The team also reports that their method scored disease-causing variants with "strong experimental evidence … substantially higher than those with weak or indirect evidence."
The paper also details the researchers' efforts to apply their computational model to spinal muscular atrophy, colorectal cancer, and autism spectrum disorder, providing proof of the wide applicability of the method. In fact, a key finding of the study came from applying the model to whole genome sequence generated from brain samples from five individuals with ASD and four controls — the researchers looked specifically at genes with high gene expression in the brain which are more frequently implicated in the disorder. They found that the genes that the model predicted would be misregulated in ASD cases did indeed have higher expression in the data from the ASD individuals than in the controls. Further analysis highlighted 19 genes as likely culprits in the development of ASD, and of this subset of genes, at least six were novel candidates.
The researchers also claim that their method does better than existing tools and approaches. It improves on computational methods that, for example, rely on functional annotation data, and those that are trained using data on existing disease annotations. That latter comparison is especially significant, Frey noted, because his team trained its system data from a reference genome and healthy human tissues "and yet we can use it to find [harmful] mutations in autism patients and cancer patients and other patients."
The researchers included in the paper the results of tests that show that their method is 25 times more sensitive than functional annotation-based methods, and detects 35.9 percent of disease variants compared to 1.4 percent using those methods. The researchers also claim that their method is nearly 10 times more sensitive in each of several sequence regions compared to methods trained using existing disease annotation. Moreover, their approach is able to detect variants without relying directly on allele frequencies, unlike methods such as genome-wide association studies and expression-based quantitative trait loci, the researchers wrote.
The technique opens the door to large-scale examination of mutations in regions of the genome that researchers haven’t really been able to look at before, Frey said. Deep learning approaches support a more in-depth look at data than standard support vector machines which have been used in many genomics and biology studies that employ machine learning. Deep learning uses neural networks which support "multiple layers of feature detectors, non-linear combinations of features, [and] logical combinations of features," he said. "These multiple levels of analysis enable you to discover complex relationships between inputs and then relate them to the output, [and] that’s turning out to be crucial for doing a good job, so deep learning has really pushed our work forward."
The researchers believe that their approach could be applied to any number of diseases to improve diagnostics and even identifying non-disease traits that differ between individuals. Frey told GenomeWeb that his team is partnering with several hospitals to apply the method to look for mutations that affect splicing in a number of disease areas. For example, they are working with researchers at the Ontario Institute for Cancer Research to use the method to identify new genetic markers of breast cancer, he said. They also plan to scale up the autism study they reported in the paper to look at data from more than 10,000 autism genomes, he added.
Frey's team is also looking to use its method to explore mutations that affect other important regulatory processes such as transcription, polyadenylation, and mRNA stabilization, he said. "These processes influence transcript levels in a highly integrated manner within the cell, so modeling them jointly should lead to more accurate predictions," the researchers wrote.