NEW YORK (GenomeWeb) – A team led by Human Longevity's Craig Venter and the Scripps Research Institute's Amalio Telenti, the former CSO of Human Longevity, has built a map of sequence constraints for the human genome that it says will help with the interpretation of noncoding regions and genetic variants.
The researchers, who describe their findings today in Nature Genetics, used 11,257 whole-genome sequences and 16,384 heptamers to build the map, which they noted differs from traditional maps of interspecies conservation and identified regulatory elements among the most constrained regions of the genome.
"Using new Hi-C experimental data, we describe a strong pattern of coordination over 2 megabases where the most constrained regulatory elements associate with the most essential genes," the authors wrote. "Constrained regions of the noncoding genome are up to 52-fold enriched for known pathogenic variants as compared to unconstrained regions (21-fold when compared to the genome average). This map of sequence constraint across thousands of individuals is an asset to help interpret noncoding elements in the human genome, prioritize variants, and reconsider gene units at a larger scale."
The researchers used genomic metaprofiles, integrating sequence variation and frequency across genomic landmarks sharing the same sequence, structure, or function. They generated massive alignments of k-mers to determine the probabilities of variation for each nucleotide in the entire genome, given the context of each nucleotide's surrounding nucleotides. Specifically, they noted that using heptamers for the analysis proved useful as the heptanucleotide context was shown in recent studies to explain more than 81 percent of variability in substitution probabilities.
To capture the large rate of variety present in heptamers, the researchers computed the rate and frequency of variation at the fourth nucleotide of each heptamer and found that it varied 95-fold across heptamers. They then used this to define the expectation of variation for each nucleotide in the genome.
"A given heptamer or region may have rates of observed variation that are higher or lower than the rates estimated across the genome. We defined the context-dependent tolerance score (CDTS) as the absolute difference of the observed variation from the expected variation. Thereafter, we divided the genome into equally sized regions using a sliding window of 550 base pairs to study context-dependent constraint without consideration of existing annotation," the authors wrote.
They ranked the genome on the basis of this score, from the most to the least context-dependent constrained, and identified patterns of enrichment and depletion for specific genomic elements. Certain findings were expected, the investigators said — for example, protein-coding exons were strongly context-dependent constrained. They also found that some chromosomes were characterized by larger content of constrained sequence.
They particularly noted that the distribution of genomic elements was robust to changes in the study population, and conducted further analyses based on CDTSs computed with a subset of 7,794 individuals who were unrelated to each other.
They found that constrained noncoding, regulatory regions in human populations can be identified by the CDTS, noting that "a large proportion of the constrained human noncoding genome is associated with regulatory elements such as promoters, enhancers, transcription factor binding sites, and regions associated with active chromatin marks."
They further speculated that the most constrained regulatory regions regulate the most functionally important genes and found that their data supported the concept of constrained and coordinated regulatory and coding units in the genome over large genomic distances. When they assessed whether CDTS ranking was a good proxy to score functionality and the consequences of mutations, the researchers found that CDTS captured the highest proportion of variants uniquely detected by a single metric, and that CDTS requires no prior knowledge and so captures a very specific set of pathogenic variants that are not detected by other metrics.
"In summary, we assessed constraint of the human genome based solely on human variation. Its clinical relevance is manifested by the enrichment of known pathogenic variants in the constrained genome. A practical implementation of this observation is the targeting of sequencing efforts beyond the exome," the authors wrote. "Many exons could possibly be eliminated from targeted analysis while including an equivalent amount of sequence that represents the most constrained regions of the noncoding genome."