A systematic analysis of genome re-sequencing data generated with the Ion Torrent PGM, published in PLoS Computational Biology earlier this month, is providing details about the insertion and deletion errors that can appear in these reads and potential bioinformatics strategies for dealing with them.
Australian researchers re-sequenced the genomes of two microbes, Sulfolobus tokodaii and Bacillus amyloliquefaciens, using comparisons to the corresponding reference sequences to look at error profiles in PGM reads.
As part of its effort to characterize both base and flow level errors in PGM reads, the team considered two PGM instruments, three associated sample preparation kits, and two PGM chip densities.
The data, generated last year, pointed to two predominant error types: over- or under-representation of bases at sites where the same nucleotide occurs multiple times in a row and more difficult to predict high-frequency indels that tend to turn up at certain sites in a given sequence.
The researchers also saw some differences in error profiles depending on the flow cycle considered — an insight that may be important for not only designing error correction models but also for coming up with accurate variant detection algorithms, according to the study's first author Lauren Bragg.
The published analysis hinges on PGM data generated last year, she noted, though the group has since done some preliminary work with newer 300-base pair and 400-base pair kits and plans to continue assessing updated PGM technology as it becomes available.
"They are still under development, and there has been marked improvement in the platform through the iterations that we studied," Bragg told In Sequence.
"We're going to continue benchmarking the new datasets and the software tools," added Bragg, who was a PhD student affiliated with the University of Queensland and Australia's Commonwealth Scientific and Industrial Research Organization, or CSIRO, when the research was performed. She is currently based full-time at CSIRO.
For his part, Monsanto Genome Analysis Center researcher Todd Michael said the broad conclusions of the analysis resemble previously published platform comparisons (see IS 4/9/2013, IS 4/24/2012) as well as findings from his own group's efforts to characterize PGM sequence data for genome sequencing and assembly applications (IS 1/31/2012).
"When we originally worked on the PGM we thought it was really going to get into our space of doing very quick de novo assembly, which is a big need that we have," Michael noted. As it turned out, though, "that was difficult because of the systematic indels and errors that we found," he said.
His group did not dig into the nature these errors — or potential informatics-based solutions — in as much detail as that described in the new study, Michael said, noting that he was impressed with the Australian team's proposed models for dealing with PGM errors.
"This was an excellent dig into the error profile of our technology and a wonderful, detailed analysis of our technology," Mike Lelivelt, Ion Torrent's director of bioinformatics and software products, told IS.
But while Lelivelt praised the thorough nature of the analysis and noted that the findings more or less jibe with analyses done in house in the past, he also called the versions of the technology used in the study dated.
In particular, Lelivelt noted that Ion Torrent has made multiple improvements to its wet chemistry methods, chip technology, and software tools since data for the study was generated in an effort to not only diminish the types of errors described in the study, but also to stretch out PGM read lengths.
In particular, he noted that some of the same bioinformatics strategies suggested in the new paper have already been incorporated into newer versions of the PGM software based on the company's own independent analyses.
"We would love for them to re-examine the study with some of the current chips, kits, and software that we have," Lelivelt said.
In general, the Ion Torrent platform measures changes in ion concentrations as a means of determining DNA sequence. After amplifying an immobilized DNA sequence of interest by emulsion PCR, nucleotides are passed over this template DNA in a pre-determined order.
When a nucleotide meets its match on the template DNA strand, it interacts with that base and gets incorporated into a complementary DNA strand by a polymerase enzyme — a process that releases protons that can be detected by the sequencing system.
The current analysis of PGM read data stemmed from the Australian team's interest in trying to hammer out protocols for doing targeted amplicon sequencing-based microbial ecology studies with the PGM instrument, Bragg noted, which is expected to be faster, more affordable, and more scalable than the Roche 454 platforms traditionally used for such studies.
For their first steps in that direction, the researchers set out to get a better sense of the error profile associated with PGM reads with an eye to coming up with appropriate strategies for dealing with the data in the context of their microbial ecology studies.
"For microbial ecology, we amplify specific genes and use this to infer phylogeny," Bragg said.
"These amplicons are really sensitive to error within the reads," she explained, "so we commonly use bioinformatic algorithms to correct the errors in the reads so that we can get an accurate representation of the community's diversity."
Because such error correction tools rely on an intimate knowledge of the error models present in a sequencing platform's read data, Bragg and her colleagues set out to look at this in more detail by re-sequencing the genomes of microbial species for which reference genome sequences were available.
Three bugs were originally selected for the analysis — based on their genome size, available genome sequences, and the varying representation by guanine and cytosine, or GC, nucleotides.
The analyses described in the PLoS Computational Biology paper relied on reads for two of the microbes, S. tokodaii and B. amyloliquefaciens, since researchers ran into problems putting together a viable library for Deinococcus maricopensis, which has a very high GC representation in its genome.
The team generated sequence data on two PGM machines using three different sample preparation kits: a 100-base kit called the Ion OneTouch Template Kit and two 200-base kits, the Ion Xpress Template 200 Kit and the Ion OneTouch 200 Template kit.
Similarly, the data comparison considered reads that had been obtained from the PGM instrument running either the 200,000 read density chip, known as the 314 chip, or the higher density (one million read) 316 chip.
"We comprehensively examine the types of errors and biases in PGM-sequenced data across several experimental variables," the study authors explained, "including chip density, template kit, template DNA, and across two machines."
The group also considered both base level errors — errors at each nucleotide within a given stretch of sequence — as well as flow level errors, which are related to the number of nucleotides that appropriately bind as bases pass over the template DNA during the sequencing process.
"People intuitively think of error rate at the base level because that's what they see coming out when the data's converted from the signal to nucleotide level," Bragg said.
But given the nature of the sequencing approach, she explained, flow level errors can also occur when an inappropriate number of nucleotides interact and append on a given DNA template as each of the four nucleotides flow over the DNA.
Indeed, the group did see variation in error rate between flow cycles that "wouldn't really have been detected at the base level," Bragg noted.
In general, researchers found that more than 93 percent of re-sequencing reads generated on the PGM platform mapped to the appropriate genome, with reads showing a mean read quality that approached Q-scores of 33.
The investigators saw a slight sequence bias against parts of the genome with low GC content and a stronger bias against sequences with particularly high GC representation, Bragg noted.
For the most part, though, the errors they encountered were indels — particularly homopolymer errors related to flow call glitches and indels that appeared at higher-than-usual levels at particular sequence sites.
The latter errors, dubbed high-frequency indels, can occur often enough at a particular site that they look like genuine variants, Bragg noted, which may make them difficult to deal with informatically.
On the other hand, the group found that multiple datasets generated from the same DNA did not always include the insertions or deletions, hinting that it might be possible to get past these errors with sufficient replication.
"Some datasets had the same indel errors present but not all of them," Bragg said. "If you do replication, you might be able to resolve some of these high-frequency indels, but they are quite common."
Overall, the team found high-frequency indels in roughly one of every 1,000 bases relative to the re-sequenced organisms' reference genomes, which corresponds to 0.06 percent of bases in each read.
With the OneTouch 200 base kit, the homopolymer indel errors turned up at an average rate of 2.84 percent in the raw data. That dipped to 1.38 percent for quality corrected, clipped reads.
Based on preliminary analyses of reads from PGM kits producing 300 base or 400 base reads, Bragg said that there do seem to be improvements in the PGM reads, particularly in terms of homopolymer calls. In the most recent PGM data she's analyzed, though, high-frequency indels have continued turning up nearly as often as in the reads from older versions of the technology.
Still, the longer and longer read length kits available from PGM could be useful for amplicon sequencing applications if informatics tools are found to effectively deal with PGM errors and the high-frequency indels in particular, she noted.
Indeed, Monsanto's Michael noted that while they are not currently using the PGM for genome sequencing and assembly studies, he and his colleagues are scaling up the PGM capacity that they have devoted to amplicon sequencing.
"When you're looking for something very specific, as in amplicon sequencing, the PGM performs quite well," he said, noting that the platform's turnaround time and ease-of-use are particularly appealing.
He noted that the instrument may prove useful for more routine genome re-sequencing in the future as well, since it seems feasible to work through PGM errors in situations where a sequence is somewhat known from the get-go and/or when the appropriate level of coverage is possible.
"I think it's still brilliant technology," he said, arguing that the current analyses represents the type of research that's needed to "push [the PGM technology] to the next level."