COLD SPRING HARBOR, NY (GenomeWeb) – Johns Hopkins University's Alexis Battle has developed a model that draws on both whole-genome sequencing and RNA-sequencing data to identify and prioritize rare non-coding variants that likely have functional effects.
Battle and her team turned to data from the Genotype-Tissue Expression (GTEx) Project to search for cellular-level disruptions that might indicate regulatory variation, as she told the audience Wednesday at the Biology of Genomes meeting here. In particular, they focused on instances of extreme changes in gene expression to home in on the effects of regulatory variation.
Despite their abundance and importance, predicting the impact of non-coding variants is still a challenge, Battle said. "It's still a really hard problem," she added.
The GTEx cohort Battle drew upon includes whole-genome sequencing data from nearly 150 donors as well as RNA-seq data from 54 tissues from some 520 donors. She noted, though, that while the sample set included multiple tissue samples per person, not all 54 tissue types were available for each donor.
From this, they searched for transcriptional outliers. The availability of data from multiple tissues enabled them to be more confident that they had indeed uncovered extreme expression outliers, she noted. Related tissues, such as ones all derived from the brain or all from the digestive system, shared a number of such outliers.
But the question, she said, is whether that is driven by genetic variation.
Individuals with rare variants were enriched for outlier gene expression, she said, especially at genes located within some 10 kilobases of the variant. SNVs, indels, and structural variants also exhibited strong enrichment for genes with rare variants. The researchers also noted enrichment at transcription start and splice sites.
This enrichment effect became stronger when the researchers were even more stringent about what they called an outlier.
Battle noted that rare variants are more enriched in lower expression outliers, as opposed to higher expression outliers. She suggested that could be because it is easier to disrupt a gene than to induce its overexpression.
She and her team also developed a probabilistic model that incorporates both whole-genome sequencing data and RNA-seq data from the same person to sniff out likely functional rare regulatory variants. This integrated model prioritizes variants that are supported by both expression and genomic features. It doesn't rely on labeled variants or training, she noted.
The researchers evaluated the accuracy of the approach by examining variants that were only found in two people from the cohort. One person was deemed the observer and the other as the labeled variant so that whole-genome sequencing and expression data in the first person could be evaluated against that in the second.
The integrated model gives a large and significant boost performance over a genome-only model, Battle said. "It's doing what you might do by hand on your own, but doing it in a probabilistic framework," she added.