Skip to main content
Premium Trial:

Request an Annual Quote

UW Team Develops Single-Molecule Tagging Method to Reduce Sequencing Errors, Detect Very Rare Variants


Building on a targeted sequencing method that uses molecular inversion probes to capture and sequence genes, researchers at the University of Washington have added a single-molecule tagging component to reduce the sequencing error rate and enable the detection of low-frequency variants.

The method, described in a recent paper in Genome Research, "has relevance for tackling different kinds of heterogeneity," both in research and eventually the clinic, by enabling the detection of rare mutations in cancer, somatic mosaic mutations in Mendelian or other disorders, and mutations from a blood sample that may be present at a lower frequency than in the diseased tissue, senior author Jay Shendure told In Sequence.

The method is "clever" and the researchers demonstrated "error rates that are clinically useful," said Isaac Kinde, a graduate student at Johns Hopkins University who has worked on a similar tagging-based method to identify rare mutations called Safe-SeqS.

The UW team tested the protocol on cancer cell lines and clinical cancer samples, demonstrating that they could detect mutations down to a 1 percent frequency with 83 percent sensitivity and no false positives, and that they could detect mutations present with a frequency as low as 0.2 percent. They estimated the per-base error rate was 8.4x10-6 in cell lines and 2.6x10-5 in the clinical samples.

The single-molecule tagging component enables the detection of low-frequency variants because "instead of trying to cancel out errors by sampling a diploid population over and over again, now you're taking it on a molecule-by-molecule basis," Joe Hiatt, lead author and graduate student in Shendure's lab, explained. "For every single molecule, you basically take the consensus, or average, of all the sequence base calls, where before, you were effectively trying to do that assuming there were at most two copies of a particular locus in your sample."

With the new method, "you're not really concerned about that; you're just taking every single molecule that you're processing and producing an unique consensus sequence for that molecule," he added.

According to Shendure, the method adds negligible cost — a couple extra percentage to the capture reagent costs — and the workflow is built into the standard molecular inversion probe workflow, much of which can be automated.

He said his lab is now considering incorporating the method into all of its resequencing studies because "even in germline sequencing, we're finding it can add a lot of value in terms of quality control and base calling."

Last year, Shendure's lab reported in Science an approach for multiplexed targeted sequencing using MIPs (IS 12/18/2012). In that study, the team used MIPs to target and sequence 44 candidate genes from 2,446 autism samples.

Now, the team has combined that strategy with single-molecule tagging to increase the accuracy. In their recent Genome Research paper, the researchers designed a pool of 1,312 single-molecule MIP 80-mer oligonucleotides targeting the coding regions of 33 cancer-related genes, encompassing around 125 kilobases of sequence.

The technique involves two layers of indexing. The index sequence resolves capture products from the source DNA, while the molecular tag resolves "reads derived from distinct genomic equivalents within individual source DNAs," the authors wrote. Then, before aligning to the reference genome, overlapping regions of read-pairs are combined to produce forward-reverse reads, or fr-reads. These reads are then aligned to the genome, and the molecular tag is used to group fr-reads to form single-molecule consensus reads, or smc-reads.

Essentially, each MIP is embedded with a tag, which provides a "unique identifier for each capture event," Shendure explained.

The team tested its technique on two sets of experiments simultaneously. First, in order to determine sensitivity and positive predictive value and to see how precisely the method could call low-frequency variants, they applied it to two HapMap cell lines and six mixtures of the two lines, targeting 33 cancer-related genes, Additionally, to determine its practical utility, the team tested it on 45 samples encompassing a range of cancers, including 40 formalin-fixed paraffin-embedded samples representing non-hematological cancers.

The team performed 55 capture reactions in parallel, using around 500 nanograms of DNA, and sequenced the samples on the Illumina HiSeq 2000. They obtained an average smc-read coverage of 3,538x across the targeted bases in the HapMap samples and an smc-read coverage of 1,051x across the targeted bases from the clinical samples.

The smc-reads were then used to call clonal genotypes from the samples. For the HapMap samples, the team was able to compare the calls to the 1000 Genomes Project. In the first sample, they detected 24 out of the 25 variant sites. The one undetected site was not called due to inadequate coverage.

The team also detected two additional variants that had not been previously detected, but were subsequently validated and also found through manual inspection of more recent data from the 1000 Genomes Project. From the second HapMap sample, the team detected 41 out of 44 known variants, with the three missed variants being due to inadequate coverage.

Based on this data, they estimated sensitivity of the assay for clonal homozygous or heterozygous variation to be 93 percent to 96 percent with a positive predictive value near 100 percent.

They then created six different mixtures of DNA from the HapMap samples to assess the method's ability to detect sub-clonal variation. The dilutions represented allele frequencies ranging from 11 percent to 0.2 percent.

They found that when smc-read coverage was at least 100x, the observed and expected variant frequencies were concordant.

Error rates of the smc-reads were also greatly reduced compared to the fr-reads. For the HapMap samples, error rates were reduced 13-fold to 8.4x10-6 per base from 1.1x10-4 per base. For the hematologic clinical samples, error rates were reduced 12-fold to 9.5x10-6 per base from 1.1x10-4 per base. The improvement for FFPE samples was lower, but still a 5-fold improvement — to 2.7x10-5 from 1.3x10-4.

Examining the source of errors and the differences between the HapMap and clinical samples, the researchers found that the majority of the errors were due to DNA damage, and that, unsurprisingly, these errors were especially prevalent in the FFPE samples.

To assess the clinical applicability of the method, the researchers compared the smMIP method to single-gene testing for clinically informative mutations. Their 45 clinical samples had been previously tested for clinically relevant mutations to the cancer genes BRAF, EGFR, FLT3, JAK2, KIT, KRAS, NRAS, and PDGFRA. The smMIP assay detected 25 out of 27 previously identified mutations and detected an additional two mutations to KRAS in two lung cancer samples. The two mutations that the smMIP assay missed were large insertions to the FLT3 gene, "although these could in principle be detected using a more sensitive analysis strategy and/or more densely tiled probes in this region," the authors wrote.

Next they looked at the assay's ability to detect very low-frequency substitutions at clinically informative sites to BRAF, EGFR, JAK2, KRAS, NRAS, and PDGFRA. Before filtering for variant frequency or confidence, the researchers detected 17 candidate variants, but restricted the remaining analysis to the seven that were called with high confidence. These candidate mutations were present at frequencies between 0.24 percent and 7.5 percent.

Based on experimental and biological evidence, the candidate variants appeared to be real, however, the authors note that for variants at frequencies near 0.1 percent, the false discovery rate could be as high as 40 percent, so "it remains possible that one or more of these variants is artifactual."

Hiatt added that the low-frequency mutations were consistent with what is known about the cancers and the types of mutations that they typically harbor, so "are likely to be valid."

"Whether they're biologically significant and clinically significant is still an open question," he added, "but this technology enables that study."

Shendure added that aside from studying low-frequency mutations in cancer, the technology will enable the study of variants in heterogeneous samples. For instance, diseases that manifest themselves in specific tissues will have higher frequency disease-related variants in those tissues, but samples are not always accessible — such as in the case of brain-related diseases like Parkinson's or Alzheimer's. While those same mutations may be present in blood samples, they will often be present at a lower frequency, so the method could be applied to these cases as well.

Finally, the team modified the method slightly, streamlining it and running it on the Illumina MiSeq, in order to create a protocol more amenable to a clinically useful time frame. They estimated that the MiSeq protocol could be done in 72 to 96 hours, compared to around two weeks on the HiSeq due to its 10 day sequencing run.

Shendure said that while some work still needs to be done to optimize the protocol and refine the informatics, the group is now focused on applying the method. "What could be better is the MIP design as well as the informatics of dealing with both the grouping tags as well as the more standard challenges of informatics like base calling, particularly when you're looking for these low-frequency events."

"We're in the process of incorporating this technology into MIPs more generally in the lab and applying it to a number of different research questions," Hiatt added.

The researchers also believe the protocol is amenable to being applied clinically because it is "compatible with very fast turnarounds," said Shendure. Additionally, there are "relatively few processing steps, which makes it attractive," Hiatt added.

Shendure declined to comment on whether he plans to commercialize the method. Currently, there are a number of other protocols for assessing low-frequency mutations.

Aside from Johns Hopkins' Safe-SeqS technique, which researchers recently used to identify rare mutations related to endometrial and ovarian cancer from a Pap smear (CSN 2/6/2013), startup Population Genetics Technology uses a tagging approach to improve on error rates, which it described in a Nucleic Acids Research study.

The Safe-SeqS method and the smMIP technique are comparable, said Hopkins' Kinde, because they are both based on tags. However, one potential difference could be coverage. For instance, the UW team used more than 1,300 probes in the smMIP technique, and inevitably some will perform better than others, Kinde said, which could result in inadequate coverage to some areas of the genome.

Additionally, Shendure said that while one advantage of Safe-SeqS is its simplicity, it may be less multiplexable than the MIP-based resequencing approach. For instance, in the Genome Research study, the UW team targeted 125 kilobases of sequence over 33 genes, while in the Hopkins' researchers most recent study in Science Translational Medicine, Safe-SeqS was used to target around 10 kilobases of sequence over 12 genes.

Kinde added that the two methods were both able to significantly reduce error rates and improve rare variant detection, and each may have advantages in different applications. Given the potential clinical significance of low-frequency variants, "we need lots of people thinking about how to best detect rare mutations," he said.