New York (GenomeWeb) – Researchers from Princeton University and the Flatiron Institute’s Center for Computational Biology have developed a deep learning approach that they say can predict the effects of genetic variants in noncoding regions on gene expression in specific tissues as well as on disease risk.
Using the method, dubbed ExPecto, the researchers predicted the effects of more than 140 million mutations in different tissues. They also identified mutations that are potentially responsible for increasing the risk of several immune diseases, including chronic hepatitis B infections and Crohn's disease. They believe the method could someday be used to help researchers identify clinically relevant disease-linked mutations in noncoding parts of the genome and develop improved therapies that can treat associated conditions. It could also provide insights into the evolutionary constraints on gene expression, which could be valuable for understanding genetic diseases and could ultimately factor into personalized medicine efforts.
Details of the method and the effort to identify variants associated with disease risk in the context of immune-related conditions were published this week in Nature Genetics. According to the paper, ExPecto uses machine learning techniques to predict tissue-specific expression from a wide regulatory region of 40-kb sequences close to promoter regions of the genome. The motivation for the study was to try to decipher the precise regulatory code found in the noncoding portion of the genome, according to the researchers.
Specifically, "Specifically, "the goal was to see if we could, based only on the genomic sequence, be able to predict the tissue-specific gene expression as well as the effects of any possible mutation," Olga Troyanskaya, deputy director of genomics at the Flatiron Institute’s Center for Computational Biology, a professor at Princeton University, and one of the study's co-authors, explained in an interview. This is critical "both from the perspective of evolution and understanding the evolutionary constraints on gene expression and from a perspective of being able to really in the long run be able to enable personalized medicine," she said.
Much of the work done so far to understand the regulatory code has focused on understanding that activities of specific variants which requires access both to the mutations themselves and the corresponding gene expression information, according to Troyanskaya. But the problem with this approach is that "the vast majority of variants that are impactful or functional especially ones that are likely to be disease-causing are going to be rare … so you would not see enough examples to actually learn about them," she said. Moreover, much of the work of exploring the effects of non-coding variants has largely been done in model organisms. And given the size of the non-coding regions in these organisms, these finding do not translate well to much larger human genomes, she added.
ExPecto builds on existing work on an epigenetic effect prediction method called DeepSea that was described in a separate paper published in Nature Methods in 2015 by two of the authors involved the current Nature Genetics study. As explained in that paper, DeepSea provides a deep learning-based framework for predicting the chromatin effects of sequence alterations. Specifically, that method predicts the epigenetic state of sequences, including transcription binding factors and histone marks, and uses this information to predict the chromatin effects of sequence variants and to prioritize functional variants, including expression quantitative trait loci and disease-associated variants.
ExPecto expands on its predecessor to include a redesigned architecture and wider sequence context, among other updates, according to the Nature Genetics paper. To make predictions using the approach, the researchers first generated a series of potential regulatory sequence representations that were predictive of the epigenetic effects of variants from sequence only. Next, they integrated the predicted sequence-based epigenetic effect across 40-kb regions to create a single reference genome. Finally, they used the integrated epigenomic information to predict gene expression in 218 tissues and cell types.
A key component of the method is that it does not rely on existing variant information for training and this makes it possible to predict the expression effects of both common and rare variants that haven’t been observed previously; its focus instead is on learning from chromatin patterns. From the previous DeepSea study, "[the] fundamental insight was that you can do this in silico chromatin prediction where … the [models] don't learn based on examples of variants; they just use one example reference genome sequence … but they are essentially learning, from across the genome, the patterns of how the regulation is encoded in the sequence," Troyanskaya said.
As part of their study, the researchers evaluated the accuracy of ExPecto's predictions of tissue-specific effects of variations by comparing its predictions to the eQTL data gleaned from the Genotype-Tissue Expression (GTEx) Project, which offers access to gene expression and quantitative trait loci data from 53 human tissues. Their results showed that the variants ExPecto predicted to have a marked effect on gene expression had also been identified by the GTEx studies. In fact, ExPecto correctly predicted the direction of expression change for 92 percent of the top 500 variants with the strongest effects, according to the paper.
In another study described in the paper, the researchers used ExPecto to prioritize novel potentially causal variants associated with four immune-related diseases — Crohn’s disease, ulcerative colitis, Behçet’s disease, and hepatitis B virus (HBV) infection. ExPecto predicted the effects of new mutations that in some cases have not been reported in existing studies. For example, the researchers prioritized and validated three SNPs, based on their effects on the expression of genes involved in the immune response, that they believe are more promising potential contributors to Crohn’s disease, chronic HBV infection, and Behçet's disease than variants proposed by previous genome-wide association studies. In all seven GWAS studies analyzed as part of the study, none of the lead SNPs identified showed “significant differences in transcriptional regulatory activity,” they wrote.
The developers have made ExPecto's predictions freely available in an online resource called HumanBase, which they describe as a one-stop-shop for biological and biomedical researchers interested in data-driven predictions of gene expression, function, regulation, and interactions in humans, particularly in the context of specific cell types or tissues and human disease. Users can type in a gene and pull up a list of the potential mutations that could affect that gene’s expression in any of 218 tissues and cell types.
ExPecto's developers believe that besides targeted therapy development, the tool can be useful for studying the evolutionary consequences of mutations. For example, they found that mutations were less likely to affect genes expressed throughout the human body than genes specialized for one tissue type. They suspect that this may be related to the robustness of genes expressed around the body – given their widespread effect, mutations in these types of genes could be detrimental to the organism. However, additional studies are needed, they said.
Moving forward, the team plans to make improvements to ExPecto to increase the quality of its predictions. The team will also continue to use the software in studies focused on the evolutionary consequences of mutations, as well as in assessing disease risk.