NEW YORK (GenomeWeb) – DNA secondary structures known as G-quadruplexes have been studied for years for their potential roles in gene function, gene regulation, genome stability, cancer progression, and even as a drug target. However, determining where such structures form in the genome have proven challenging and methods have mostly been based on computational predictions.
Now, researchers from the University of Cambridge have developed a next-generation sequencing-based approach to detect G-quadruplex (G4) structures. They published the method as well as the use of the method to map G4 structures in the human genome this week in Nature Biotechnology.
Led by Shankar Balasubramanian's lab, the method, known as G4-seq, involves a couple of tweaks to the standard Illumina sequencing protocol that enables researchers to detect G4 structures by measuring "perturbations" in the polymerase, Balasubramanian explained.
The team used the method to map the location of G4 structures in the human genome, which they plan to deposit in a public database for others in the field to mine. Balasubramanian also said that he plans to apply the method to other genomes in order to study the role of G4 structures in human disease, with the eventual goal of advancing clinical trials of small molecules that target the structures.
The study "provides a new scope of what the G quadruplex is," Balasubramanian told GenomeWeb. "Some of these structures would have been impossible to predict using any of the previous prediction tools." The study also presents the "first detailed, experimentally derived map of G4s in the genome."
G4s only form in certain conditions. The idea behind G4-seq is that the researchers first devise conditions in which the structures form in the DNA and then sequence the DNA according to a normal protocol. The structure itself will cause a detectable perturbation in the polymerase.
In the study, the researchers figured out that G4s are very stable in potassium but not so in sodium or lithium.
To test the effect of sequencing on DNA with and without G4 structures and under different buffer conditions, they sequenced a DNA library spiked with four known control sequences: two containing stable G4 structures, one mutated to prevent G4 formation, and a strand that cannot fold into a G4.
Next, they sequenced the libraries, supplementing Illumina buffers with either lithium, sodium, or potassium salts.
The overall sequencing quality score under the three buffer conditions was not affected. However, when potassium salt was used as a buffer, quality scores dropped at specific locations. For the two control sequences known to contain G4s, the mismatch rate was 34 percent and 46 percent, respectively.
By contrast, the control sequences known to not have G4 structures maintained high quality for all conditions.
In order to both identify the presence of the G4 structures and also to maintain high accuracy in base calling and alignment, the researchers sequenced each template twice — once using the sodium buffer so that the G4 structures would not form, and once using the potassium buffer so that the G4s would form.
Typically, "if your Q scores drop during a read, what normally happens is computationally, you get rid of those reads and treat them as bad sequences," Balasubramanian said. "In our method, we don't do that. We look to see where there is discontinuity as compared to the sodium buffer, for each read," he said. Then, "we map the differences, and that's how we find out where the quadruplexes start."
Aside from potassium, the team also tested pyridostatin (PDS), a G4-stabilizing ligand. For the two control sequences with known G4s, mismatch rates were 45 percent and 66 percent, respectively. In addition, the researchers noted that the sequencing errors accumulated after the G4 start sites.
"When the polymerase encounters a stable G4 in the DNA template, a pause is induced, which can effectively truncate the reading of the template sequence," the authors wrote. "When this happens, the sequencer will continue to generate what appears to be a scrambled sequence beyond this point."
Next, the team analyzed 32 million reads comprising around 110,000 predicted G4s. In both potassium and PDS buffers, mismatch rates were higher in the sequences predicted to have G4s — a median of 20 percent for potassium and 35 percent for PDS. However, there were also a small fraction of sequences not predicted to have G4 structures that showed a similarly high mismatch rate, "suggesting that the number and nature of human genomic G4s is substantially broader than previously predicted," the authors wrote.
Finally, the researchers applied G4-seq genome-wide to a reference cell line, NA18507, using the Illumina HiSeq. They performed sequencing under sodium buffer conditions for the first read and either potassium or PDS conditions for the second read. Each experiment was performed in duplicate and generated at least 285 million reads.
In order to determine potential G4 structure start sites, the researchers set a mismatch threshold of 25 percent and 14 percent for PDS and potassium, respectively. Under those criteria, the team called 716,310 G4s under the PDS condition and 525,890 G4s under the potassium conditions. Of the 361,424 G4s that had been previously predicted computationally, 73 percent were detected using the PDS buffer and 60 percent were detected using the potassium buffer. Ninety percent of the G4s detected in the potassium buffer were also found in the PDS condition, and of the total number of G4s called, 383,984 were common to both conditions.
"The high overlap between distinct G4-stabilizing conditions provides independent validation of the assignment of [observed quadruplexes]," the authors wrote.
The researchers also noticed a couple of trends with regards to the genomic locations of the G4s. There was a high density in promoter regions, 5' and 3' untranslated regions, and repetitive elements, Balasubramanian said. In addition, "for certain genes, we found they were densely populated in parts of the gene body." The structures were also significantly associated with oncogenes, tumor suppressors, and somatic copy number alterations related to cancer development.
For instance, they observed a high density of G4s in the genes MYC, TERT, AKT1, FGFR3, and BCL2L1, which all relate to somatic copy number amplifications. "This is consistent with a mechanistic link between G4s and the sites of genomic instability, a hallmark of cancer," the authors wrote.
"These observations seem to concur with some of the previous ideas and some experimental support with regards to function of G4s," Balasubramanian said.
Aside from identifying predicted G4s, the team also found G4s in a number of cancer-related genes not previously thought to contain such structures, including BRCA1, BRCA2, and MAP3K8, suggesting that those genes may be "particularly sensitive to treatment with G4-stabilizing ligands," the authors wrote.
Further studies mapping G4s and studying their functional impact could have implications for cancer treatment, Balasubramanian said. "Already, there have been a couple of molecules that target G-quadruplexes that have made it into phase II trials," he said. "But I think now, the data we have, I think takes us to a new level in this field compared to where we were five years ago."
The goal now will be to uncover "mechanistic evidence of what these structures do to genomes as well as to figure out the pathways associated with them," and to "really understand whether and how we can exploit them as therapeutic targets," he said.