A team of Broad Institute researchers has developed a panel of qPCR assays designed to identify the primary causes of PCR amplification bias in Illumina sequencing libraries, and built an optimized protocol that can help reduce such bias, according to a recent paper.
The scientists said they expect the new protocol to amplify sequencing libraries more evenly than the standard Illumina protocol, and to minimize bias introduced by factors such as choice of thermocycler and PCR amplification enzyme, and temperature ramp rate.
And although their optimized protocol does not minimize bias in all scenarios, it is expected to improve sequencing of GC-rich regions of the human genome, which contain important information for cancer and medical genetics studies, the researchers said.
The group, led by Broad researchers Daniel Aird and Andreas Gnirke, first disclosed their work in a presentation at BioMed Central's Beyond the Genome conference in Boston in October. This week they published additional research details in an advanced online paper in Genome Biology.
The team explained that massively parallel sequencing platforms such as the Illumina HiSeq 2000 instruments employed in Broad's Genome Sequencing and Analysis Program frequently suffer from under-representation and reduced quality at loci with extreme base compositions, in particular GC-rich loci.
In order to better understand the sources of this bias and possibly diminish their effects, they systematically dissected the process, using qPCR instead of Illumina sequencing as a way to quickly read base-composition bias.
To those ends, the group developed a panel of qPCR assays for loci ranging from 6 percent GC to 90 percent GC by using as a test substrate microbial DNA samples of different base compositions: Plasmodium falciparum, with 19 percent GC content; Escherichia coli, with 51 percent GC content; and Rhodobacter sphaeroides, with 69 percent GC content.
The group also developed qPCR assays for loci in the human genome that represent four categories of underrepresented sequence motifs; as well as GC-rich promoters known to be underrepresented or missing in whole genome-sequencing data sets.
As detailed in the Genome Biology paper, the researchers then tracked the relative abundance of these loci throughout the standard Illumina library protocol, and saw no significant introduction of bias in early steps such as shearing, end repair, adaptor ligation, and size selection.
However, they found that during the subsequent PCR-enrichment step, both GC-rich and –poor sequences were significantly depleted. Specifically, they noted that as few as ten PCR cycles using the standard enzyme formulation — Phusion HF DNA polymerase from ThermoFisher's Finnzymes business — and using the standard thermocycling conditions, depleted loci with a GC content of more than 65 percent to about a hundredth of reference loci and depleted loci with a GC content of less than 12 percent to about one-tenth of pre-amplification levels.
They also identified amplification-based bias influenced by their choice of thermocycler — two different machines from Eppendorf and an Applied Biosystems platform — as well as the instruments' default ramp speeds.
They concluded that "an overly steep thermoprofile does not leave sufficient time above a critical threshold temperature, causing incomplete denaturation and poor amplification of the GC-rich fraction."
Using the worst-performing thermocycler — in this case the Eppendorf Mastercycler ep gradient S — the researchers then set out to optimize the protocol to reduce amplification bias. They reasoned that a protocol that worked well on that instrument would also work well on the better-performing machines.
Using their qPCR assays, the team assessed amplification bias after toggling a number of variables, including the use of different PCR enzymes (Phusion HF versus Life Technologies' AccuPrime Taq HiFi); the addition of PCR-enhancers betaine or DMSO; and the thermocycling profiles.
In the end, for the microbial DNA samples, "no single PCR protocol was ideal," the researchers wrote, noting that the best protocol for high-GC regions — Phusion with betaine — led to poor representation of high-AT loci.
Meantime, the protocol that worked best for high-AT regions — AccuPrime Taq HiFi with primer extension at 60oC — compromised the high-GC fraction.
However, for the human genome loci, the researchers were able to develop an optimized protocol that produced PCR-amplified libraries showing little systematic bias between the 15 percent and 80 percent GC content that resulted during sample preparation. They also reported significantly improved representation of challenging human-sequence motifs both in the PCR-amplified library and in their final Illumina sequencing reads.
In their paper, the researchers conceded that their solution was still a compromise.
"None of our conditions work equally well at rescuing, at the same time, under-representation of regions that are either extremely GC-rich or GC-poor," they wrote. "At the time of this writing, by our assay, PCR with AccuPrime Taq HiFi at a low primer-extension temperature is the best compromise."
However, they noted they "did not test an exhaustive list of PCR enzymes and reaction conditions," and added that it is possible that "other enzymes would perform as well or even better."
The study also identified other, albeit lesser, sources of bias, such as the downstream cluster amplification and sequencing-by-synthesis steps of the Illumina protocol.
Nevertheless, "improved PCR conditions like the one described here will likely satisfy the vast majority of projects," they wrote in their paper. "Enhancing the coverage of high-GC loci is critical for human genome and exome sequencing in cancer and medical genetics, the major sequencing applications in terms of bases generated.
"Solving the loss of AT-rich loci remains a challenge, but has less of an impact on human genome sequencing and on the sequencing field as a whole," they concluded.
The researchers noted they are continuing to investigate potential ways to ameliorate the bias effects of downstream cluster amplification and sequencing-by-synthesis; as well as factors that may improve sequencing of AT-rich loci.