Researchers at Cold Spring Harbor Laboratory have developed a new base caller for the Illumina Genome Analyzer and have used it to increase the number of accurate long reads per run.
The Alta-Cyclic base caller enabled the researchers to increase the number of accurate 78-base reads in a run from 5 percent to 22 percent. These reads, they say, may allow them to assemble complex eukaryotic genomes de novo.
The researchers set out to improve the base caller that comes with the Illumina GA because the accuracy and length of the reads the instrument produced was not sufficient for their needs. “In addition … it was just very interesting to open the black box that’s called ‘Illumina’ and see what we have there,” said Yaniv Erlich, a graduate student in Greg Hannon’s lab and the first author of a paper describing the tool that appears online in Nature Methods this month.
“Once we did that and identified an error model, then we realized that we can actually improve it using supervised learning,” Erlich said.
Both Erlich and co-author Partha Mitra have a background in communication engineering, which they applied to the project. “In wireless [communication], you often have a channel through which a signal passes and gets corrupted, but you can send pilot signals, and you can learn the channel, and then you can undo the corruption that happened to your signal,” explained Mitra, a professor of biomathematics at Cold Spring Harbor.
In a similar manner, the scientists use a known viral reference genome as a “pilot signal” in each run in order to train the system to read the “signal” from the actual samples accurately.
The main difference between Alta-Cyclic and other base callers is that it generates a new profile of signal distortions for every run, rather than relying on historical data, according to Erlich. “I think it’s fair to say that so far, our method is the most adaptive of the ones that are out there,” Mitra said.
Initially, the researchers identified three main noise factors of the Illumina system. One factor, phasing, results from some strands within a DNA cluster growing faster or more slowly than the others due to errors in the chemistry cycles.
The second factor is a change in cross-talk between fluorophores, which leads to “a substantial bias toward certain base calls in later cycles,” according to the researchers.
The third factor is a loss in material in each cycle that leads to a decay in fluorescent signal intensity, or fading, over time.
“You can send pilot signals, and you can learn the channel, and then you can undo the corruption that happened to your signal.”
Based on these three factors, the researchers built a model that describes the signal distortion as a function of the sequencing cycle, and found that they can compensate both for phasing and changes in cross-talk but not for the loss of material.
Using these results, they designed the Alta-Cyclic base caller. During a training stage, the software analyzes data from a known viral genome and compares it with the sequence. It “looks at the signal and tries to see which parameters to choose in order to get the correct answer from the intensity files,” Erlich explained. “Then it takes these parameters and uses them to call the bases from the other lanes” containing unknown samples.
In a 78-cycle run, the researchers were able to increase the number of error-free reads more than fourfold, from 600,000 using the Illumina base caller to 2.6 million using Alta-Cyclic.
But that read length is not the limit, according to Erlich. “The reason that we have 78 cycles is that we ran out of disk space,” he said, adding that he and his colleagues are now writing a program that can take image files and store them in different locations “so we can get more cycles.”
Alta-Cyclic is most effective at improving the accuracy of base calls in later cycles, or at the ends of long reads. “For very short read lengths, our method and the existing method both do equally well,” Mitra said.
Having more accurate long reads improves SNP calling and may enable the researchers to assemble complex genomes de novo. Erlich said he and his colleagues are planning to de novo sequence a plant, an animal, and a protist.
Other research groups, including scientists at the Broad Institute, have worked on methods to improve the quality of base calls for the 454 Genome Sequencer and the Illumina GA (see In Sequence 3/25/2008), but “this is a full-fledged base caller,” said Gabor Marth, an assistant professor of biology at Boston College. His group recently published an improved base caller for the 454 platform.
“This is a machine-learning approach, which is one very valid approach to base calling,” Marth said. “The results are quite encouraging,” he added, noting that “what I would have liked to see is a discussion of the accuracy of the base quality values.”
“We have to do more work on precisely quantifying the accuracy of base calls as a function of position, but we do have an estimate” based on an experiment on calling artificial SNPs that is described in the publication, according to Mitra.
The drawback of Alta-Cyclic, according to Marth, is the need to include a reference lane for training in each run, which takes time to analyze, reduces the output of “real” data per run, and consumes reagents.
However, the researchers are already working on ways to reduce the amount of training and speed up the algorithm by collecting more data.
“The training that we are doing now is very robust, but it’s very intensive,” Erlich said, adding that it takes overnight on a large computer cluster. “Once we have more data, we can try to reduce the training operations that we are doing right now.”
Though Alta-Cyclic is optimized for the Illumina GA, “the general method of taking a known signal — a reference genome — and using it in order to subtract noise can be applied to all other sequencers,” he said.
Alta-Cyclic is freely available for academic researchers, and the developers said they are negotiating with Illumina about its commercial use.
Illumina did not respond before deadline to requests about whether it will adopt Alta-Cyclic for the GA or support the software for use with the system.