Researchers from Tel Aviv and Pennsylvania State Universities have developed a new tool called FlowgramFixer that they claim calls bases from flowgrams generated by the Ion Torrent Personal Genome Machine more accurately than the current algorithm used for this purpose.
David Golan, a doctoral student in Tel Aviv University's statistics department and one of FlowgramFixer's developers, presented the tool during one of the sessions at the recent Intelligent Systems for Molecular Biology conference in Berlin, Germany.
He and his co-developer, Paul Medvedev, an assistant professor in PSU's departments of computer science and engineering, and biochemistry and molecular biology, also published a paper in a recent issue of Bioinformatics that described the freely available software in detail. They claim that FlowgramFixer calls bases from PGM's flowgram files with fewer errors and generates more uniquely aligned and higher-quality reads than the default base calling algorithm that is implemented in the Torrent suite software. This means fewer errors further downstream during the variant calling step where miscalled bases "pose challenges" particularly for resequencing projects, "where they can be confused with SNPs."
To highlight the improvements FlowgramFixer offers and to contrast its approach with the one used by Ion Torrent, Golan and Medvedev begin their paper by exploring the underlying rationale for the base-calling algorithm in PGM's software suite.
At each sequencing step during the PGM's run, a chip carrying the sequence is washed with a specific nucleotide — either A,C,G, or T — and it incorporates complementary nucleotides releasing electric signals that comprise flowgrams, which include signal values — either 0 or 1 — that indicate whether or not a base was incorporated. In cases where values deviate, the Ion Torrent algorithm rounds them up to the nearest integer. "In essence, it is a memory-less algorithm that makes a call for each flow independent of information from previous or following flows," the researchers wrote.
The problem, however, is that "rounding the signal flow-by-flow might result in an ‘impossible’ sequence of signals," they argue. The Ion Torrent algorithm also doesn't take into account that "the probability of observing an incorporation event depends on the incorporation signals of previous and next flows," they wrote.
For example, a possible flowgram from the PGM might generate noisy signal values such as 0.2 or 0.8 instead of a 0 or a 1, which would indicate whether or not a nucleotide was incorporated, Golan explained to BioInform. In these cases, the Ion Torrent algorithm basically rounds up or down to the nearest integer. However, this means that a lot of information is lost in the process, he said. If, for instance, the ACG nucleotides are not incorporated in a particular position during a wash cycle then that nucleotide has to be a T, so "even if the signal is 0.4 instead of rounding it to 0 you should round it to 1 because there has to be a T."
A third issue with the approach is that "rounding signals ignores other previous information regarding the genome, such as GC-content and the lower frequencies of longer homopolymers," the researchers wrote.
FlowgramFixer, on the other hand, works on the assumption that "the signals of neighboring flows carry considerable mutual information and are important in making the correct base-call." As a result, it calls bases "at a read-wide level, rather than one flow at a time," the Bioinformatics paper states. Specifically, it uses a state machine and two dynamic programming algorithms — a Viterbi algorithm and a forward algorithm — to find the nucleotide sequence that most likely explains that observed flowgram.
The result is fewer errors and more accurately mapped reads, according to the researchers. In some tests with Escherichia coli data that compared FlowgramFixer to the algorithm used by the Torrent suite, FlowgramFixer showed between 2.8 to 4.8 percent improvement in the number of high-quality mapped reads and a 7.1 percent improvement in the number of uniquely mapped reads, the paper states.
Other benefits of FlowgramFixer besides improved accuracy include low memory requirements and the ability to run tasks in parallel, enabling it to complete its analyses in "a matter of minutes." The tool can also be integrated into Ion Torrent's software, the researchers wrote, where it could help improve the quality of the phase correction step.
Golan told BioInform that he and Medvedev have reached out to Ion Torrent regarding possibly assimilating FlowgramFixer into the Torrent suite but "we [haven't had] any progress with that" so far.
That may be because Ion Torrent believes it has come up with a way to address the loss of information that happens during the base calling step that it has implemented in the Torrent suite software. Mike Lelivelt, the director of bioinformatics and software products for Life Technologies' Ion Torrent business unit, told BioInform that the company has tackled the problem by improving its mapping and variant calling procedures using a similar process to Flowgram Fixer's.
He explained that the company changed the primary file format used in its software from Standard Flowgram Format, which is used to encode the results of pyrosequencing data, to the BAM file format. This makes it possible to provide both the base calls as well as "a representation" of the raw "unrounded" flowgram data that provides "richer" read level information for calling variants, he said. It has also implemented a new algorithm, called the Torrent Variant Caller, which is used to analyze the flowgram data during the mapping and variant calling steps.
These changes, he said, are implemented in versions of the Torrent suite that postdate the one used by the Golan and Medvedev for their study. In his comments on the paper, Lelivelt said the company was aware of the study as it developed and he applauded the researchers' efforts to improve the base calling accuracy. However, he believes that Ion Torrent's approach addresses the problem the Golan and Medvedev sought to solve. Since the Torrent suite now has access to both base calls and read level information when it calls variants, accurate base calling early on in the process "is less critical," he said
FlowgramFixer's developers are working on improvements to the software. For example, they are working on tailoring FlowgramFixer to work with the new file format, Golan said. They're also working on creating a more user-friendly version of FlowgramFixer, which will be released at some point in the future," Golan said. To use the tool as it is right now requires some technical skill "so anyone interested in running it is more than welcome to contact us and we'll be happy to help," he said.
One user, Carlos Peña, a postdoctoral researcher in the University of Turku's biology department, told BioInform in an email that experienced Linux users will find FlowgramFixer "quite easy to install and run if you follow the instructions on the software's website," however, "if you use Windows only, you will suffer."
He also highlighted an area for improvement. "I could use FlowgramFixer in my data if the software was able to keep track of quality value data for each Ion Torrent read. Then I could do the assembly of reads into amplicons," he said. "So far, FlowgramFixer gives you a list of sequences without sequence ID and without quality data." Peña said he has requested that this feature be added and that the developers plan to include it in a future release of the tool.