NEW YORK (GenomeWeb) – Johns Hopkins University Medical Center researchers have developed a sequence-based computational method that they say is better able to predict the effects of variants in regulatory regions than current approaches.
As they reported today in Nature Genetics, the researchers developed a tool called deltaSVM to predict the effect that SNPs have on regulatory elements like DNase I sensitivity sites and gene expression.
Genome-wide association studies, the researchers noted, have linked a number of variants in non-coding regions to disease, but the functional impact of such variants can be difficult to tease out.
"Our computer program can comb through the genetic information from a specific cell type and predict which 'dimmer switch' mutations are most likely to alter the cell's gene activity, and therefore its function," Hopkins' Michael Beer said in a statement.
For their approach, the researchers first train their gapped k-mer support vector machine (gkm-SVM) on a set of putative regulatory regions in a specific cell type. Their deltaSVM measure then gauges how single-site sequence changes within those regulatory regions would affect gene expression in that cell type.
For instance, Beer and his colleagues note that if the gkm-SVM were trained on a set of DNase I–hypersensitive sites, then the deltaSVM measure would quantify how various sequence features affect chromatin accessibility.
Indeed, they trained a gkm-SVM on the top DNase I–hypersensitive sites in a set of human lymphoblastoid cell lines and calculated deltaSVMs across nearly a dozen 10-mers that encompass the SNVs to produce a score.
A SNP that disrupts the NF-κB binding site, they noted, reduced the strong positive contribution of a number of 10-mers to the SVM score; two nearby SNPs also affected the score, though to a lesser extent.
For the 579 SNPs within 100 basepairs of a DNase I–hypersensitivity site, the researchers found deltaSVM scores correlated with effect size of those SNPs, though correlation dropped off with increasing distance from the DNase I–hypersensitive site.
This approach, the researchers reported, was 55.9 percent accurate — about 10 times more accurate than other programs.
The researchers attributed the increased accuracy of their method to three of its features: training on a set of regulatory elements whose activity is specific to the cell type at hand, training on both positive and negative elements, and identifying a catalog of both positive and negative sequence features.
Beer and his colleagues also applied this approach to a large set of putative melanocyte enhancers. Using luciferase assays, they noted that their deltaSVM score was strongly correlated with the differences seen in luciferase reporter activity for both mutation and wild-type enhancer constructs.
Similarly, in mouse and human liver cells, they noted a high correspondence between deltaSVM score and the output seen in a functional assay.
The approach works best, the researchers reported, when it is trained using an appropriate cell type. For instance, they trained three separate gkm-SVMs with DNase I–hypersensitive sites in three cell types. They then compared deltaSVM values for three SNPs linked to prostate cancer, fetal hemoglobin levels, and cholesterol levels, respectively. For each, they noted that the validated SNP only scored higher than other nearby ones when deltaSVM was trained using the appropriate cell type.
"By training the computer program with the right cellular material, we can now predict the consequences of previously undecipherable regulatory sequence mutations," said Andrew McCallion, also from Hopkins.
The researchers also used their approach to uncover new potential causal SNPs. Beer and his colleagues examined some 413 SNPs linked with 11 autoimmune diseases that affect T-helper cells. After training a gkm-SVM on a T-helper cell DNase I–hypersensitive sites, they scored each disease-associated locus as well as others in strong linkage disequilibrium and a set of random control SNPs. They uncovered SNPs with high deltaSVM scores for 17 disease associations, and many of these SNPs were not the lead SNPs.
The next step, the researchers said in a statement, is to collect samples from patients with these autoimmune disorders to test whether their predictions were correct.
"If so, it should help us determine how the activity is being perturbed and how we can fix it," Beer said.