Scientists at Washington University School of Medicine and Boston College have shown that Illumina’s Genome Analyzer can be used to accurately discover polymorphisms in eukaryotic genomes, a development that could embolden researchers to use the technology in whole-genome resequencing projects.
The researchers, who performed their study on C. elegans, showed that despite the GA’s short reads, the platform can be used to resequence complex genomes and identify genetic polymorphisms including SNPs, small insertions, and small deletions.
Their work, which appeared online on Sunday in Nature Methods, is also believed to be the first peer-reviewed study that includes paired-end reads generated with Illumina’s sequencing platform.
The goal of the project was to use Illumina’s Genome Analyzer to resequence the genome of a model organism that is larger than yeast but smaller than the human genome, according to study author Gabor Marth, an assistant professor in the department of biology at Boston College.
The project had two parts. First, the researchers resequenced the C. elegans reference strain N2 Bristol, whose genome had already been sequenced using Sanger technology. “There, the goal was to evaluate how accurate the reference genome is, and how accurate the [Genome Analyzer’s] reads are,” Marth said.
After aligning data from 3.5 single-end runs performed at Wash U’s Genome Sequencing Center along with “much lower coverage” of paired-end reads generated at Illumina to the reference genome, the researchers found several differences between their strain isolate and the original strain.
The team traced some of these differences to mutations that evolved in the strain during its time in the lab. Others were the result of sequencing errors in the original Sanger-generated genome sequence, which the researchers reported to WormBase for correction.
Marth said that the error rate of the Illumina technology was low, but cautioned that error rates depend on factors such as how alignments are performed. “There are different ways, there is not really a fair way to count sequencing errors,” he said. “But overall, the sequencing accuracy [of the Illumina technology] is very high; [the error rate] is definitely less than 1 percent.”
Marth also pointed out that since the researchers submitted their results for publication in September 2007, the technology has improved “significantly” thanks to better engineering, chemistry, and base calling, and “much improved” paired-end read protocols.
In the second part of their study, the scientists used Illumina’s platform to resequence another C. elegans strain. This step of the study was designed to assess the ability of Illumina’s technology to discover polymorphisms between this strain and the reference strain.
In order to avoid ambiguous read alignments, the researchers masked repetitive regions of the genome, which comprised 23 percent of the DNA. They determined these regions by using the RepeatMasker program and by searching the genome for so-called microrepeats, or repetitive 32mers, with up to two mismatches.
Using 1.5 single-read runs on Illumina’s platform to sequence the C. elegans strain, the researchers obtained a nine-fold coverage that was “not enough to completely cover every base to the quality that you can necessarily call SNPs, but it was enough to cover a large fraction of the genome, probably over 85 percent of the sequenceable part of the genome,” Marth said.
“For finding short, 1- to 2-base pair indels, this is really ideal.”
The researchers aligned the Illumina reads to the masked genome using a new alignment and assembly program called Mosaik that was developed by a graduate student in Marth’s group.
To call polymorphisms, they used a modified version of PolyBayes, an existing program they improved so it runs faster and copes with millions of sequence reads.
The researchers found approximately 45,000 SNPs and 7,000 short insertions or deletions. Their findings corresponded to the SNP rate suggested by an earlier study of the C. elegans strain and confirmed about 96 percent of those SNPs chosen for validation.
Interestingly, the researchers found that the error rate in insertions or deletions was “incredibly low” — making up less than 2 percent of all errors, according to Marth.
“It’s quite different from, say, 454, where most of the errors are indels,” he said. Because of that, Illumina’s technology is very good at discerning short indels, maybe even better than Sanger technology, he said. “For finding short, 1- to 2-base pair indels, this is really ideal.”
Based on these results, Marth said he believes that Illumina’s technology is “very, very suitable” for whole-genome resequencing projects and “is going to be the choice for the new human resequencing projects” (see feature article in this issue).
Other Genome Analyzer users agree with the findings of Marth and his colleagues. Stephen Kingsmore, president of the National Center for Genome Resources of Santa Fe, NM, said that their results “are in broad agreement with ours on the same instrument.”
“I agree totally with the paper’s conclusion that the short-read technologies have utility ‘for the accurate discovery of both single-nucleotide and small insertion-deletion polymorphisms,’” Kingsmore told In Sequence by e-mail. He added that his group hopes to publish a study showing similar SNP discovery data.
Since the C. elegans project was completed, Marth and his colleagues have used the Genome Analyzer in a different whole-genome resequencing project, aimed at complete mutational profiling.
For example, researchers who have mutagenized a model organism and found a phenotypically interesting mutant might want to know where the mutation occured that caused that phenotype. Whole-genome resequencing can pinpoint the mutation, but “what this requires is that you don’t miss mutations, and you don’t have too high a false-positive rate,” Marth said.
In one such study of a “yeast-like” organism, he and his colleagues have compared all three available next-gen sequencing technologies, from 454, Illumina, and ABI.
“At about 15x coverage, you can basically [find all the point mutations] with any of the technologies,” Marth said. “Some [require] a little less, some a little more [coverage], but overall there is not a big difference between the technologies.”
Marth said he and his colleagues also plan to evaluate other next-generation sequencing technologies that are still in development, such as those from Helicos BioSciences and Pacific Biosciences.