Through a combination of sequence sleuthing and laboratory experiments, a Broad Institute team has unearthed a set of low allelic fraction mutations in some tumor samples that are caused by oxidation during DNA preparation itself rather than by authentic biological processes.
These mutation artifacts — first detected in deep sequence data on tumors tested by targeted capture-based sequencing — typically convert cytosine bases to adenine (or the complementary bases, guanine to thymine), Broad researchers explained in a study appearing online recently in Nucleic Acids Research.
While the artifacts were only found in a low percentage of reads, their presence could
"certainly derail the accuracy and limit of detection" in cancer studies and other projects looking for rare mutations, the authors note.
Furthermore, they argue that the oxidation-mediated event they uncovered "is one of the myriad of possible low frequency errors that could be induced during NGS sample preparation" and call for a "systematic review" of data obtained using different protocols from various laboratories "to identify whether there are any types of other artifacts that may be induced during extraction and/or library preparation that could be wrongly attributed to the biology of a given disease."
In the paper, senior author Gad Getz and colleagues described methods used to detect the artifacts and determine their source. But they also presented DNA preparation steps designed to curb the production of these oxidation-induced genetic changes, along with informatics-based approaches for weeding the artifacts out of existing tumor sequence data.
"We really had fun being part of this detective story," senior author Gad Getz, director of the Broad Institute's cancer genome analysis group, told In Sequence.
Within a few weeks of findings the artifact problem, he explained, researchers working in the sequencing and analysis arms of the Broad had tracked down its cause, found ways to directly measure it, and came up with laboratory and informatics-based strategies for dealing with it.
At first, though, the source of the genetic glitches was far murkier. While doing cancer-specific analyses on tumor sequence data coming out of the Broad's Genomic Platform last spring, Getz recalled, members of his team noticed an unusual mutation signature among a set of mutations represented by relatively few reads.
These alterations were unearthed when the team assessed its data using a newly developed mutation caller known as MuTect, which is especially sensitive to these low allelic fraction mutations.
"Once we applied this new tool, which was still in development at that phase, we started detecting many more mutations in some samples," Getz said. "And what was interesting about those mutations was that they had a very different spectrum compared to other mutations that we typically see in cancer."
Typically, for instance, many of the most well supported mutations found in tumor samples involve base changes that swap in thymine in the place of cytosine. That mutation signature tends to be especially common in a sequence context where the original cytosine falls between a thymine and a guanine base.
"The most common pattern that we see is elevated [cytosine to thymine] whenever [cytosine] precedes [guanine]," Getz noted. "This is kind of a standard mutation called by spontaneous deamination of methylated CpGs."
"Once we started looking at the lower allelic fraction mutations, we started to see, popping up, a different signature," he added.
In contrast to the more common cytosine to thymine mutation signature, this signature in the low allelic fraction of the read data was instead characterized by cytosine to adenine exchanges.
These often appeared on cytosines nestled between cytosine and guanine bases, though there was a jump in cytosine-to-adenine mutations within practically any sequence context in which cytosine was preceded by another cytosine.
"That was very exciting, because we thought maybe we found a new type of carcinogen or a mechanism that generates mutations in cancer that was not known before and affects only low allelic fraction mutations," Getz said.
Had they been authentic, he explained, such mutations might have provided a peek at late events during cancer evolution, revealing alterations that affected just a small fraction of tumor cells and had not yet become fixed in the population, for instance.
As it happened, though, a fortuitous set of validation experiments stopped the scientists before they got too far down that path.
By calling mutations in matched normal samples relative to tumors — a reverse of the usual tumor analysis strategy — the researchers demonstrated that the low frequency cytosine-to-adenine changes were every bit as common in the normal samples as they were in tumor sequences.
Though present to varying degrees in reads from both the tumor and normal samples, these changes seemed slightly more common in samples prepared using low DNA input volumes as well as samples prepared using targeted sequence capture protocols.
Based on such patterns, the team determined that these mutation artifacts were likely being introduced at some stage of the DNA extraction, library preparation, or sequencing process.
To figure out just when and why this was happening, the data analysis group began putting their heads together with researchers from the Broad's Genomics Platform, where the samples had been sequenced.
"It was an extremely good collaboration between our groups," Getz noted, "because we immediately started iterating to figure out how [the artifact mutation] was being generated."
To begin characterizing the artifact and trying to find its source, for instance, co-author Timothy Fennell, with the Broad's Genomics Platform, came up with a metric known as Artifact-Q, or ArtQ, for measuring the mutation signature across sequence sets generated at the Broad in the past.
That metric was in place even before the researchers knew exactly what kind of glitch they were dealing with, explained the Broad's Maura Costello, first author on the new study.
Because the artifact had a distinct signature, as well as some strand and read specificity, Costello told IS, it was possible to track its presence or absence across a massive dataset that included all of the exomes sequenced at the Broad since late 2008 or early 2009, as well as many of the genome sequences.
"We plugged that in as sort of a pipeline metric," she said. "So it was something we could run without having to do mutation calling.
When they ran the ArtQ script as a pipeline metric for the hybrid capture samples, for example, the researchers found that the mutation artifact had been turning up with increasing frequency in samples sequenced since around about 2010.
"It did seem to get worse or more pronounced since about the end of 2010," Costello explained, corresponding with a time when researchers at the Broad and other sequencing centers started lowering the amount of input DNA used to prepare their Illumina libraries.
Even so, lower DNA input alone wasn't the only factor contributing to artifact production, it seemed, since the team still saw variable levels of these changes in samples produced with similar amounts of starting material.
And the sequencing step itself was not the problem, the researchers determined, because they saw similar artifact signatures regardless of whether samples were run on Illumina HiSeq or MiSeq instruments, using different Illumina chemistries, or with the Ion Torrent platform.
Likewise, researchers ruled out the targeted capture step, by sequencing libraries from the same samples before and after exome enrichment.
As they continued to work backwards from sequencing in a stepwise manner in the lab and through literature searches, the researchers ultimately realized that the artifact that they were seeing was a consequence of oxidation related to contaminants present at the DNA shearing stage of DNA preparation — a pattern they verified using an oxidation-specific ELISA kit.
When samples in contaminant-containing buffers were sheared using the 150 base pair protocol, she explained, oxidation levels were "through the roof."
"It wasn't every sample, but we were able to show that the 150 base pair shearing protocol that we run in the Covaris shearing instrument … had a higher rate [of artifact mutations] than if you sheared to a larger, 500 base pair size," Costello said.
That effect may be more pronounced for samples produced from lower levels of input DNA, she added, because the acoustic energy used for shearing the genetic material acts on a smaller number of bases overall, increasing the chance that any given base will be affected.
"The way we sort of conceptualized this is that basically every base now has an increased potential for having this sort of oxidation reaction happening to it, because you have the same amount of energy but actually fewer moles of DNA bases in there," Costello said, though she cautioned that the researchers still "don't have 100 percent direct evidence for that."
Questions also remain about the chemical context that could be upping the chances of DNA oxidation during shearing. Going forward, the researchers are ultimately interested in figuring out exactly what the offending contaminant actually is and where in the DNA extraction protocol it gets introduced.
"We eventually would like to figure out what in these extraction protocols is leaving these oxidative radicals that are then getting excited during the shearing process and leaving oxidation so we can give best practices to labs that are submitting samples for sequencing," Costello said.
In the meantime, though, the team has come up with solutions on the DNA preparation and analysis sides for preventing and dealing with the resulting artifacts.
On the laboratory side, they've started doing routine buffer exchanges on DNA samples coming into the center, replacing all of the solutions bathing the samples with a buffer that contains the chelating chemical EDTA.
That buffer exchange step seems to be working well as a fix for samples coming from a wide range of collaborating labs, Costello noted.
Consequently, this buffer exchange step is now being applied for a wide range of samples being sequenced by the center, though the artifact problem is generally more of an issue when dealing with tumor samples, which are far more heterogeneous than other samples being sequenced routinely.
The team has also developed an informatics-based approach for filtering the spurious mutations — nicknamed oxoG lesions, based on a characteristic 8-oxoguanine nucleotide formed during the accidental oxidation process — out of existing sequence data.
The filter, which researchers applied to mutations called by MuTect, takes into account so-called "fraction of alternate allele reads in the oxoG artifact configuration," or FoxoG, ratios as well as the overall mutation patterns present in a given tumor sample.
"We basically built kind of a classifier," Getz said, explaining that the current version of the filter is more advanced than that described in the Nucleic Acids Research study. A more detailed description of the MuTect mutation caller will appear in another upcoming publication, he noted.
The Broad group has also been in touch with other large sequencing centers — particularly those collaborating on the Cancer Genome Atlas project — to make sure they are aware of the potential preparation-related artifact problem and are equipped to deal with it.
"It's crucial that they know that this exists so once they change their protocol they could test whether they generate these kinds of mutations or not," Getz said.