Members of the functional interpretation team of the 1000 Genomes Project have developed an informatics workflow tool dubbed Function-based Prioritization of Sequence Variants, or FunSeq, that functionally annotates cancer-associated variants found in non-coding portions of the human genome and prioritizes them in terms of the strength of their impact on the disease.
FunSeq's workflow — available in web-based and downloadable software formats — is based on a framework used by the researchers to annotate functional variation in both coding and non-coding genomic regions. However, it was developed specifically to handle deleterious variants that occur in non-coding regions, setting it apart from existing annotation tools such as SIFT, ANNOVAR, and the Variant Annotation Tool (BI 7/13/2012), which are geared more towards finding mutations in coding parts of the genome, according to Mark Gerstein, a Yale University professor of biomedical informatics and a senior author on a Science paper published last week. That paper includes a description of FunSeq and potential applications to cancer and personal genomes data annotation.
The Science paper describes the larger functional annotation framework that the team used to annotate genetic variants from the first phase of the 1000 Genomes project. Specifically, they used structural variants, single nucleotide polymorphisms, and insertions and deletions from 1,092 individuals along with functional data generated by the Encyclopedia of DNA Elements project, to "study patterns of selection in various functional categories, especially non-coding regulatory regions" and identify "genomic regions where variants are more likely to have strong phenotypic impact."
"This study was really an attempt to begin sub-dividing this vast non-coding fraction of the genome into different functional categories and identifying the most relevant and important ones that may harbor disease-causing variants or other variants of interest," Chris Tyler-Smith, head of the human evolution team at the Wellcome Trust Sanger Institute and a senior author on the paper, told BioInform. "We know about protein-coding genes, and we are reasonably good at predicting variants that change amino acids or splice sites and so on, but most of the genome is not made of these protein-coding regions, it’s the non-coding DNA."
To find functionally important parts of the non-coding regions, the researchers overlapped data from the 1000 Genomes project and from ENCODE, then looked at each functional category that was identified by ENCODE and tried to classify them by importance, Tyler-Smith explained.
"The criterion that we used there was whether variation accumulates in [these genomic regions] in the general population," he said. "Essentially, the regions where there's not much variation are likely to be the functionally important ones." The reasoning here is that since deleterious variants are removed by purifying or negative selection at the population level, genomic regions that have had many variants purged are likely to have initially contained harmful mutations that have been selected against over time. Looking for "signatures" of purifying or negative selection, can help researchers identify what could be functionally important variants.
Using this approach, the researchers were able to identify non-coding genomic regions that were particularly susceptible to genetic mutations — so-called "ultrasensitive" regions — as well as variants that were harmful because of "mechanistic effects" they had on transcription-factor binding sites — these variants are referred to as "motif-breakers." Other deleterious variants occurred in regions of the genome that had high network connectivity.
FunSeq was developed to provide a computational means of running this analysis and could be applied specifically to different kinds of cancers but also more broadly in cases such as rare disease mutation studies. It can also be used to analyze personal genomes data.
Having software that can analyze non-coding elements is useful because although "the deleterious effects of rare inherited variants and somatic cancer mutations in non-coding regions have not been explored in a genome-wide fashion," some studies show that these parts of the genome may contain relevant diseases mutations, the researchers wrote. As an example, they highlight three recent studies including this one done by researchers in Germany, Spain, and Sweden and also published in Science earlier this year, that found non-coding driver mutations in the telomerase reverse transcriptase, or TERT, promoter, in tumor types like melanomas and gliomas.
"In light of these studies and the growing availability of whole-genome cancer sequencing, an integrated framework facilitating functional interpretation of non-coding variants would be useful," the researchers wrote.
FunSeq works by first filtering out known mutations that have been identified by the1000 Genomes project. It then prioritizes the rest based on whether they appear in regions "under strong selection (sensitive and ultrasensitive)," those that break TF motifs, and those that lie in regions of high network connectivity. It scores each variant on a scale of 0 to 6 with a score of six indicating that the variant has "maximum deleterious effect." It's able to analyze multiple samples at a time and identify recurring non-coding cancer mutations.
To prove its efficacy, the researchers included the results of FunSeq's application to three cancer datasets — three medulloblastoma cases, 21 breast cancer cases, and 64 prostate cancer cases. In total, they identified 106 potentially harmful non-coding mutations with 54 occurring in sensitive regions, 65 that broke TF motifs, and 98 that targeted network hubs.
Although it was developed with cancer data analysis in mind, the researchers believe that FunSeq could be used in studies that aim to find potentially harmful mutations in the genome more broadly — Mendelian disease studies would be one example, Gerstein said. As proof of this point, the researchers used FunSeq to analyze variants from four personal genomes. "Out of 3 million SNVs we were able to identify [approximately] 15 [potentially deleterious] non-coding SNVs per individual," they wrote.
Moving forward, the developers plan to update FunSeq on a regular basis. Gerstein said that his team has begun adding more cancer-specific features to FunSeq's workflow such as "a more expanded list of non-coding annotation."