NEW YORK (GenomeWeb News) – By considering sequence data for individuals assessed through the 1000 Genomes Project, a team led by researchers from Yale University and Wellcome Trust Sanger Institute came up with a computational method for prioritizing potential disease culprits — including those in non-protein-coding parts of the genome.
As they reported online today in Science, the researchers sifted through SNP profiles in coding and non-coding sequences in 1,092 genomes, focusing on functionally annotated areas. With the help of information from the ENCODE project, mutation databases, and other data sources, they narrowed in on sequences that seem especially sensitive to change.
The group tapped these mutation-sensitive sites to develop an approach called FunSeq, which proved useful for uncovering new apparent driver mutations using sequences from around 90 cancer genomes. These included almost 100 driver candidates in non-coding sequences, according to study authors, who noted that FunSeq is expected to help in tracking down crucial non-coding variants in other disease types as well.
"Our technique allows scientists to focus in on the most functionally important parts of the non-coding regions of the genome," co-senior author Mark Gerstein, a computational biology and bioinformatics researcher at the University of Yale, said in a statement. "This is not just beneficial for cancer research, but can be extended to other genetic diseases, too."
"Although we see that the first effective use of our tool is for cancer genomes, this method can be applied to find any potential disease-causing variant in the non-coding regions of the genome," the Sanger Institute's Chris Tyler-Smith, co-senior author on the study, said in a statement.
The ability to discern functionally important variants is critical for interpreting information in the human genome and finding changes that can produce disease, the researchers noted. But the consequences of many variants are unknown and tricky to define, especially those occurring outside of protein-coding sequences.
Conservation across multiple mammalian species can offer some clues to the importance of various sequences. For the current study, though, investigators turned to available human population data, reasoning that "signatures of purifying selection identified by using population-variation data could provide better insights into the importance of a genomic region in humans than evolutionary conservation."
Using polymorphism patterns determined for the 1,092 genomes profiled for the 1000 Genomes Project, the team searched for sequences that appear to be sensitive to alterations and subject to purifying selection.
"As expected," the researchers wrote, "we found that having variants from 1,092 individuals allowed us to detect specific functional categories under strong purifying selection with greater power than previously possible."
To further tease apart functionally important variants, researchers incorporated other types of data, too, including mutation information from the Human Gene Mutation Database, patterns found in one individual's genome sequence, interaction network data, and results from the ENCODE project.
By applying this type of analysis across hundreds of sequence categories, the team got a sense of the relative strength of selection working in different coding and non-coding sequences. For instance, some apparent transcription factor binding sites appeared especially sensitive to mutation, as did genes at the heart of interaction networks.
By formalizing their functional variant and mutation predictions into a computational tool known as FunSeq that focuses on regions of the genome that seem especially sensitive to change, the investigators demonstrated that they could identify around 100 non-coding driver mutations using information from 90 tumor genomes. Those included 21 breast cancer genomes, three medulloblastoma samples, and dozens of prostate cancers.
"This allows us to take a systematic approach to cancer genomics," Gerstein said. "Now we do not need to limit ourselves to the roughly [1 percent] of the genome that codes for proteins but can explore the rest of our DNA."
The study's authors noted that it should be feasible to gain insights into other types of disease risk by scrutinizing the same sorts of mutation-sensitive non-coding elements defined in the current analysis.
"Because they cover a small fraction of the entire genome (comparable to the exome), these regions can be probed alongside exome sequences in clinical study," they wrote, explaining that the variant sorting scheme may be further refined in the future by folding in additional population profiles and other types of genomic data.