By testing nearly two dozen polymerase enzymes under different PCR conditions, researchers from the Wellcome Trust Sanger Institute and elsewhere have come up with a library preparation method tailored to sequencing genomes that are especially rich in adenine and thymine bases.
"We have developed and optimized library-preparation procedures suitable for low quantity starting material and tolerant to extremely high AT content," Sanger researcher Michael Quail, the study's senior author, and co-authors wrote.
As they reported online in BMC Genomics last week, the investigators amplified DNA from an organism with a notoriously AT-rich genome — the malaria-causing parasite Plasmodium falciparum — and then compared sequences obtained when they sequenced the libraries on the Illumina platform.
Along with PCR-amplification variables, the team also tested some alternative amplification methods for producing DNA libraries, including isothermal recombinase polymerase amplification and a linear amplification for deep sequencing method involving an RNA amplification step.
In the end, they were most successful in getting coverage across the AT-rich P. falciparum sequences in both lab and clinical strains of the malaria parasite using a PCR-based approach and the Kapa HiFi polymerase from Kapa Biosystems. The enzyme performed best in the presence of the thermostabilizing chemical tetramethylammonium chloride, or TMAC, though results from a related study by members of the same group hint that the additive may be dispensable depending on the buffer mix used with the enzyme.
The researchers have also optimized library preparation methods to amplify areas with neutral nucleotide representation or with an over-abundance of guanine and cytosine residues — findings that they published in Nature Methods late last year.
"We were looking firstly for a set of conditions that worked well for Plasmodium, and we found those," said Quail, who is corresponding author on the Nature Methods study. "But also, at the same time, if we're going to put that into production, we were keen to find a condition or formulation that would work well for a whole range of genomes."
By sequencing the genomes of microbes with a range of base compositions — from AT-rich to neutral to GC-rich — the group found that they could get fairly even coverage across all of these regions by doing sequence library amplification with Kapa HiFi.
Indeed, the researchers are confident that the Kapa HiFi enzyme, used under optimal reaction conditions, can be used in such a pipeline. Quail said his lab has now switched over to the Kapa HiFi polymerase.
Moreover, they are set to use their Illumina library preparation methods in a large-scale population genetic study of the malaria parasite being done in collaboration with other Sanger researchers.
The group plans to resequence P. falciparum genomes from thousands of clinical samples collected at field sites in Africa and Asia. They will then bring together genetic data and corresponding information on patient histories, disease outcomes, and severity to find SNPs related to clinical outcome and severity.
Quail called the Plasmodium genome "the greatest challenge" to virtually any sequencing strategy, owing to its extreme AT content.
'A Great Help to the Field'
Although the proportion of adenine and thymine bases vary in the genomes of different Plasmodium species, the parasite that most often infects humans has around 80 percent AT content in its genome, Quail said. Intergenic regions and introns are even more AT-rich and in some cases the two residues represent almost all of the nucleotides in those sequences.
Indeed, efforts to sequence the P. falciparum genome more than a decade ago were dogged by the same complexities, according to Malcolm Gardner, who helped sequence the draft genome of P. falciparum while working at the Institute for Genomic Research.
Gardner, now a global health researcher at the Seattle Biomedical Research Institute, said the P. falciparum genome was one of the most challenging to sequence.
"Even with the Sanger technology, we had many, many issues with it," he said. "We'd either get many reads that would truncate early or we would get inaccurate base calls."
As a result, researchers involved in the initial P. falciparum sequencing effort spent a great deal of time manually filling in sequence gaps and trying to come up with trial-and-error strategies for sequencing across AT-rich regions, Gardner recalled.
"It would certainly be a great help to the field if we had some tools that were specifically developed to deal with AT-rich templates, especially for such an important pathogen as Plasmodium," he said.
Sequencing technology has changed dramatically since the first P. falciparum genome was reported in 2002 and the availability of second-generation sequencing platforms has made it possible to generate a great deal of sequence quickly and at relatively low cost.
But because most library-preparation methods still rely on PCR-based DNA amplification prior to sequencing, Quail explained, any step in the process that produces unequal amplification of bases can lead to biases in the resulting genome sequence.
For example, he noted that polymerase enzymes often amplify some fragments preferentially when faced with a mixture of DNA pieces of different lengths and base compositions — a problem that gets compounded with each PCR cycle.
"The standard library preparation procedures that employ PCR amplification have been shown to cause uneven read coverage particularly across AT and GC rich regions, leading to problems in genome assembly and variation analyses," the researchers wrote.
In the case of Plasmodium, Quail explained that the Illumina platform produces very GC-biased sequence when DNA is prepared with standard library prep methods. Consequently, lots of the Plasmodium genome is often missing or covered by very few reads when conventional protocols are used.
"We find the Illumina platform to be the most cost-effective high-throughput next-generation sequencing platform, and our production is all based around it," he said. "However, with the raw protocols we find that we get heavily GC-biased results for the Plasmodium genome."
"When we do sequencing, we find that really and truly only about 50 percent of the genome is accessible or present at sufficient depth to call variants," he added. "While we may have a few reads covering those regions, they are under-represented."
To try to solve such problems, the team has been working to come up with approaches to overcome that bias, including a PCR-free library prep method that they published in Nature Methods in 2009.
But while the PCR-free approach works well under some circumstances, Quail explained, it is not well suited to samples containing relatively little DNA, since it does not involve an amplification step.
The limitation is especially important in malaria clinical samples, which sometimes contain just a few hundred nanograms of parasite DNA, rather than the micrograms of DNA needed for PCR-free amplification.
To begin searching for other alternatives, the team decided to do a systematic analysis of polymerase enzymes and other amplification variables, looking for methods that could produce libraries allowing for uniform sequence coverage of genomes heavy in adenine and thymine.
The comparison involved almost two dozen enzymes — including Kapa HiFi, Kapa2G Robust, Platinum pfx, AccuPrime Taq HiFi and the Phusion enzymes typically used to amplify DNA for Illumina sequencing.
"We set about to do a screen of commercially available enzymes and other techniques we could think of," Quail explained.
For each of the enzymes tested, the team targeted AT-rich or relatively neutral sequences in a P. falciparum lab strain, using PCR conditions outlined by enzyme manufacturers in the presence and absence of the AT-thermostabilizing compound TMAC.
They also looked at whether they could improve coverage of AT-rich regions by doing library preparations using an isothermal approach called recombinase polymerase amplification method developed by the UK company TwistDx. In addition, they explored linear amplification for deep sequencing — a strategy in which DNA is converted to RNA with the T7 polymerase before being converted to complementary DNA.
While many polymerases were inhibited by TMAC, the Kapa HiFi, Kapa2G Robust, and Platinum pfx enzymes continued to function in the presence of the compound. The Kapa enzymes, but not Platinum pfx, appeared to efficiently amplify AT-rich sequences when TMAC was added, the researchers reported.
Along with this target analysis, the team looked at the coverage that they got doing Illumina paired-end sequencing of genomic libraries created using several different polymerase enzymes and DNA from either the P. falciparum lab strain 3D7 or from a clinical P. falciparum sample contaminated with host DNA.
The sequencing steps for the studies were done in triplicate, Quail explained, with researchers sequencing their initial libraries on the Illumina GAII and then generating technical replicate libraries that were sequenced on both the GAII and HiSeq platforms.
Again, the Kapa enzymes performed well, they found. Sequences produced using Kapa HiFi-amplified libraries showed the most even read depth across chromosome 11 of P. falciparum, where coverage of AT-rich regions was comparable to that seen by PCR-free methods when adequate starting material is available.
On the other hand, the isothermal amplification and linear amplification for deep sequencing methods did not seem to improve AT coverage and, in some cases, led to their own sorts of sequencing errors, such as duplicate or chimeric reads.
Overall, the researchers concluded that the "problems associated with PCR can be ameliorated through optimization, thereby allowing amplification to be used in generating sequencing libraries even from extremely AT-biased genomes."
"These alternative PCR conditions generate library fragments with increased coverage of extreme AT-rich regions," they wrote.
For his part, Gardner called the library preparation method described in the BMC Genomics study "promising" and said it "does seem to go some way toward solving some of the problems that are inherent in trying to sequence Plasmodium falciparum wild isolates."
Still, he noted that it is not the only strategy currently being used to tackle the problem of sequencing P. falciparum from clinical samples. In particular, Gardner pointed to a 2011 study led by Broad Institute researchers that used a solution hybrid selection strategy to separate Plasmodium from human DNA in clinical samples prior to amplification.
"What the Sanger group is doing is trying to improve the amplification step to get better amplification of the Plasmodium component," Gardner noted. "The Broad group is taking a slightly different approach. They're trying to extract the AT-rich DNA before making the library."
While Plasmodium species have some of the most AT-rich genomes, they are not the only organisms with a bounty of these bases in their genomes. Some bacteria, such as Staphylococcus aureus, have higher-than-usual levels of these two nucleotides as well, for instance. And the zebrafish, a model organism often used in developmental studies, has large AT-rich swaths in its genome that pose a challenge for those sequencing it.
"There are other genomes that overall … do have large islands of extreme AT-rich content," Quail said.
He and his colleagues are continuing to use a sequencing pipeline that is centered around the Illumina platform.
Another Sanger team is currently exploring the use of the PacBio RS for P. falciparum sequencing, though they believe the system would need to offer higher yield before that becomes a practical approach (see related story, this issue).
Quail speculated, however, that that there will continue to be a need for strategies to efficiently amplify complex collections of DNA fragments in a range of molecular biology studies, even if researchers begin to rely more heavily on sequencing instruments that don't require DNA amplification, such as the PacBio single-molecule platform.
Have topics you'd like to see covered in In Sequence? Contact the editor at anderson [at] genomeweb [.] com.