NEW YORK (GenomeWeb) – Next-generation sequencing, while being able to analyze massive amounts of DNA in parallel, also produces a fair number of errors. Inaccurate base calls can be caused by, for example, the sequencing chemistry, PCR mistakes, or damage to the DNA, and researchers have devised various strategies to weed out these errors.
Scientists at New England Biolabs have now found that one particular source of sequencing artifacts — DNA damage — appears to play a greater role than previously appreciated and might lead to incorrect low-frequency variant calls if not accounted for. In a paper appearing in Science today, the researchers, led by Laurence Ettwiller, a staff scientist at NEB, and Thomas Evans, scientific director of the DNA enzymes division at NEB, showed that much of the sequencing data from the 1000 Genomes Project and the Cancer Genome Atlas, both widely used resources, resulted from samples that had DNA damage.
They also proposed a new metric, called the Global Imbalance Value (GIV) score, to estimate the damage in DNA samples from sequencing data, allowing scientists to check their own samples. In addition, they found that repairing DNA prior to sequencing may reduce the sequencing error rate.
The fact that DNA can be damaged, which results in sequencing errors, has been known for a long time, in particular, for ancient DNA and DNA from formalin-fixed paraffin-embedded (FFPE) samples. In addition, researchers from the Broad Institute reported in 2013 that acoustic shearing of DNA during sample preparation causes oxidative damage, which leads to artifactual DNA mutations, in particular G to T changes.
According to Max Diehn, an assistant professor of radiation oncology at Stanford University who was not involved in the NEB study, experts in the field have known about the problem for a long time. "However, probably it's not as widely recognized by groups that don't focus on NGS applications, and given how common NGS applications are becoming and how many researchers are using these tools, that may be the biggest contribution of the study, to make others aware," he said in an interview. Diehn's group published a study in Nature Biotechnology last year that described a two-step error correction method for circulating tumor DNA sequencing.
"This is an important paper that emphasizes how careful one has to be to interpret sequencing data in a reliable fashion," said Bert Vogelstein, director of the Ludwig Center at Johns Hopkins and a prominent cancer researcher, in an email. When genome-wide sequencing first started, he said, every variant was validated, a practice that has fallen to the wayside because it takes more work to validate mutations than to find them. "The new paper will hopefully inspire better methods to validate mutations, particularly those that are of clinical interest," he said.
The original goal of the NEB researchers was to improve sequencing accuracy for damaged DNA, such as FFPE samples, Ettwiller said in an interview. But before she and her colleagues embarked on this, they wanted to have a measure of DNA damage, leading them to devise the GIV score. The score — a separate one for each of the 12 possible mutation types — relies on the fact that DNA damage results in an imbalance between variants detected in read 1 and read 2 of a paired-read sequencing run. A GIV score of 1 indicates there is no DNA damage and a GIV score above 1.5 is defined as damaged DNA. Researchers interested in checking their own samples can download the NEB team's "damage estimator" algorithm from Github.
With the GIV score in hand, the scientists started to look at damage in different types of DNA samples, and to their surprise, not only DNA from FFPE tissue but also from fresh frozen samples contained damage, although it was a different type, mostly G to T changes. "When we saw that, we thought, is this just NEB, is there something that we do wrong with the data?" Ettwiller recalled.
To check how pervasive this type of damage is, they went on to reanalyze raw sequence reads from the 1000 Genomes Project and found the same result — some of the samples seemed to have been heavily damaged. A second reanalysis, of sequence data from the Cancer Genome Atlas, yielded similar results.
Library prep seems to be one cause of DNA damage: for the 1000 Genomes data, the researchers were able to correlate the level of damage with the type of library prep used. However, because they did not have access to details of the library prep protocols used in that project, they could not pinpoint the damage to specific steps. They noticed, though, that the mutation profile – predominantly G to T – was similar to what the Broad's study had observed, so it might result from oxidative damage introduced during DNA shearing.
"I think the general sequencing community was not really aware of this and certainly had no way to determine the extent of damage in their samples, which is why the GIV, we are hoping, will be very helpful to the community," Evans said. For example, researchers can use it to flag and possibly eliminate data from samples with too much damage.
Knowing about DNA damage will be especially important for clinical laboratories sequencing cancer samples. "They definitely need to be aware of this," Ettwiller said, and determining a GIV score for every sample could help them decide how much they can trust their data.
But knowing about the DNA damage is only the first step — researchers also need to know how to deal with the resulting errors.
"The [sequencing] datasets are good, you just have to be aware of this as a potential problem," Evans said. "Like a PCR error – just because PCR makes mistakes, it doesn't mean PCR is bad, it just means you need to be aware of it in the analysis."
One way to deal with damaged-induced sequencing errors is to filter out affected reads, which can be flagged because the G to T mutations occur only in one read direction but not the other. This effectively removes most erroneous mutations, Ettwiller said, but it may also get rid of reads that contain true mutations present at a low frequency, so it will decrease the sensitivity for such mutations.
According to Diehn, these types of filtering strategies are already used by many groups, although some erroneous mutations may still sneak through, depending on how stringent the filtering is. Clinical laboratories that sequence patient samples, in particular, are already very vigilant in how they filter their data, he said. "Most labs use very stringent filtering and very strict criteria that, if anything, err on the side of under-calling rather than over-calling," he said. "But, obviously, I don't know every lab, and it is not to say that there could not be clinical labs that are not aware of this, and in those kind of situations, if they filter wrongly, you could have problems."
Another possibility to reduce sequence errors caused by DNA damage is to repair the damage prior to sequencing. In their paper, the NEB researchers tested this by preparing Illumina sequencing libraries from sheared DNA with and without in vitro DNA repair using the NEBNext FFPE DNA Repair Mix, which contains a cocktail of DNA glycosylases, endonuclease, polymerase, and ligase. Using this repair mix helped to alleviate the sequencing errors, Evans said, "but whether that will be generally useful to the community, time will tell."
Diehn said his team has tried the DNA repair strategy on ctDNA, which is unsheared, as part of the study they published last year and did not find any benefit from it in addition to the filtering and other error suppression strategies they already used. While he might have utilized different DNA repair enzymes than the NEB researchers, he thinks the effect might depend on how the DNA was prepared and whether shearing is involved.
The NEB team is currently investigating other ways to minimize DNA damage-derived sequencing errors, Ettwiller said, including various steps in the library prep process and additional enzymes that repair different types of damage, not only in DNA from fresh frozen samples but also in DNA from FFPE samples.
These studies serve the overall goal of NEB to increase the quality of library prep, and, as a result, the quality of sequencing data, Evans said, especially for heterogeneous samples such as cancer. Another part of that effort, he said, is to improve the accuracy of polymerases used in the library prep.