Skip to main content
Premium Trial:

Request an Annual Quote

NYU Researchers Use Long Sequence Reads to Study Mutation Origins


NEW YORK – Researchers from New York University have developed a method that uses long-read sequencing to analyze how mutations arise in the genome. Along the way, they reached an estimated sequencing quality score of Q140, or one error in every 100 trillion base pairs, for detecting base substitutions.

"Our goal was to understand and identify the lesions in DNA that precede mutations," said Gilad Evrony, a faculty member at NYU Langone's Center for Human Genetics and Genomics and an assistant professor of pediatrics. To do that, his lab needed a way to analyze single molecules at single-base resolution. That would require a sequencing method that didn't use amplification, "so we can believe the things we see, even if we only see [them] once," he said.

The solution was to use long-read sequencing from Pacific Biosciences, specifically the HiFi protocol, which sequences the same molecule several times in a loop, helping to paper over any errors with subsequent passes.

First, they tried it with off-the-shelf reagents, but that wasn't enough. "We had to spend about two years of development to push the PacBio sequencer to its limits," Evrony said. "We didn't know that we would ever be successful, but it turns out it is capable of reaching this."

In February, his team posted a preprint describing HiDEF-seq (hairpin duplex enhanced fidelity sequencing) on BioRxiv and expects publication in a journal later this year. The analysis included "the first single-stranded DNA signatures … of defective polymerase epsilon proofreading with and without functional mismatch repair," they wrote.

"With our technology we can see, for the first time, processes at the mutational level before they even become mutations," Evrony said. "It shows us the very first steps by which mutations arise." Evrony has filed a provisional patent application on the method, and the preprint disclosed that he owns stock in Pacific Biosciences, as well as in Illumina and Oxford Nanopore Technologies.

Methods like HiDEF-seq could be key to the pharmaceutical industry's efforts to ditch animal-based in vivo mutagenicity and toxicity screening methods that are currently required by regulators prior to human trials. Broadly speaking, error-corrected sequencing "can give you the same information that traditional tests give you," but in much less time, said Connie Chen, a program manager at the Health and Environmental Sciences Institute (HESI), a nonprofit organization in Washington, D.C. In vivo studies can take years and cost millions of dollars, she said, while sequencing experiments may be much quicker. So far, sequencing methods under consideration, such as TwinStrand Biosciences' duplex sequencing, rely on short-read sequencing, but long reads are up next. "We're just starting to get into that discussion," Chen said.

HiDEF-seq builds upon PacBio's HiFi sequencing — technically an error-correction method in its own right — and joins several other so-called duplex sequencing methods that join opposing strands to analyze them together. Oxford Nanopore also offers a duplex sequencing chemistry to boost accuracy.

The first modification Evrony's group made to PacBio sequencing was to use smaller molecules that would get read more times than usual. HiFi usually reads a molecule between three and five times; with HiDEF-seq, this increases to more than 20 times per strand, with a median of 32. "We got unbelievable sequencing accuracy, but that wasn't enough," Evrony said. "It revealed artifacts of the library preparation process that had never been seen before."

Based on an initial analysis, they concluded that restriction enzymes used to fragment DNA were likely causing single-strand nicks, leading to mismatched deoxyadenosines to be introduced during the A-tailing step of library preparation. They eliminated this by adding a nick ligase to "seal" the damage and a method of DNA end repair — first introduced by researchers at the Wellcome Sanger Institute as part of the NanoSeq protocol and published in 2021. For DNA samples with a higher level of fragmentation, they removed the A-tailing step altogether. "Eventually, we reached a stage where single-strand [lesion] calls of DNA were down to zero for healthy blood DNA," Evrony said. "Then we were ready to see whether we could actually detect double- and single-strand changes at the same time."

HiDEF-seq is more expensive than HiFi sequencing as it analyzes smaller molecules. Moreover, the computational analysis applies stringent filters, meaning it may take more raw data to generate enough reads that pass quality control. He estimated that the final cost per high fidelity gigabase pairs is about $550. That decreases to $140 per Gb on PacBio's newest instrument, the Revio.

A version of HiDEF-seq with larger DNA fragments further reduces that to $81 per Gb, compared to about $50 per Gb with NanoSeq. "If only double-strand mutations are of interest, and not single-strand changes, then standard HiFi sequencing fragment sizes may also be able to achieve single-molecule fidelity, which would further reduce the cost," Evrony added.

One downside to the method is that it doesn't reach single-molecule fidelity for indels, though the lab is trying to achieve this.

The researchers also built a computational analysis pipeline to call both types of changes and applied it to a sample from an individual with a genetic predisposition to cancer, where they expected to see higher levels of both lesions types. "We saw remarkable correspondence between single- and double-strand lesions," Evrony said. "We're really, truly seeing the precursor lesions."

Those lesions may not actually be at the same location, he noted, but the patterns of double-stranded lesions following single-stranded ones are there in the same sample. "It shows us the underlying mutational process," he said.

These patterns corresponding to mutational processes are important for potential applications of the technology. "There are many things that cause mutations, from mutagens like ultraviolet rays and smoking to endogenous processes," he said, and each cause has a "fingerprint" of single-strand lesions.

Pharmaceutical firms may want to know whether mutations in animals match signatures of their drug or other exposures, and error-corrected sequencing has usefulness in other situations, such as environmental monitoring and chemical exposures, Chen said. 

Evrony's lab is also using the technology to look at somatic mosaicism in human tissues as part of a program that recently won $140 million in funding from the National Institutes of Health. He is in touch with other labs on implementing HiDEF-seq but declined to disclose them at this time.

"Our lab's long-term vision is to create a catalog of single-stranded DNA damage and mismatch signatures" that can be related to double-strand mutations as well as DNA repair signatures, he said. "That will really give a full picture of how our genome mutates."