SAN DIEGO (GenomeWeb News) – During an American Society for Microbiology meeting session on sequencing in infectious disease research here yesterday, J. Craig Venter Institute infectious disease researcher William Nierman warned of the potential perils of using just one sequencing platform to do forensics research that requires extremely high quality microbial genomes.
To get the best possible microbial genomes, Nierman said, "you can't use a single platform … to get every call, you need a combination of [Roche] 454 and Illumina."
To illustrate problems and errors that can arise in assemblies based on genome sequences generated by a single high-throughput platform, Nierman described his team's experience using Roche 454 sequencing to sequence the genome of a Yersinia pestis strain implicated in a University of Chicago geneticist's accidental death last year.
Y. pestis, the bacterial species behind the plague, is subject to the US Centers for Disease Control and Prevention "Select Agent" rules, Nierman explained, a designation reserved for organisms and toxins that are considered severe biological threats.
Even so, laboratory strains of Y. pestis lacking virulence factors are used in laboratories and generally considered harmless to humans.
So when researcher Malcolm Casadaban died in mid-September 2009 — apparently from exposure to a seemingly harmless Y. pestis laboratory strain called KIM D27 — Nierman and his colleagues at JCVI's Genome Sequencing Center were among those called in to try to pin down telltale genetic features in the lethal strain.
The team initially used Roche 454 paired-end sequencing alone to tackle the Y. pestis KIM D27 genome — a sequencing task that, at first, appeared to be fairly straightforward. "We really saw this as an exercise in our response capability," Nierman said.
Using this approach, the team generated a draft genome sequence within less than three weeks of receiving bacterial DNA. After closing the genome and completing their preliminary genome assembly, the team shared their sequence data with Paul Keim, a pathogen genomics researcher with Northern Arizona University and the Translational Genomics Research Institute, who was sequencing a related, clinical Y. pestis strain.
Unexpectedly, though, comparisons showed that the JCVI generated Y. pestis KIM D27 genome contained hundreds of SNPs and small insertions and deletions not found in sequences that Keim and his colleagues had generated using Illumina sequencing or in the Y. pestis reference genome.
The team's subsequent cross-examination of the sequence and assembly data suggested the problem lay in sequence errors in parts of the genome containing homopolymer repeats of adenine or thymine bases, Nierman noted. Of 465 SNP and indel errors examined, he said, 381 involved these poly (A) or poly (T) sequences.
On the other hand, when they threw in data from one lane of Illumina sequencing, used Sanger sequence to fill in sequences gaps, and re-assembled the Y. pestis genome, the researchers found fewer than 60 SNPs and indels in the newly sequenced KIM D27 strain.
After further tweaking their data and looking only at regions with 40 times coverage or more, the team verified that the strain was missing a 102,000 base pair pgm locus containing known Y. pestis virulence genes. But the KIM D27 strain also contains changes not found in typical, harmless lab or reference strains — including a repeat expansion and two SNPs, Nierman explained.
Based on their experience with the Y. pestis genome, the researchers also decided to go back and look at genome data for other sequenced bacterial species, including Escherichia coli, Porphyromonas gingivalis, and Mycobacterium tuberculosis, Nierman added.
In general, they found that high coverage genomes had much lower homopolymer-related error rates, Nierman said. Still, he noted, the Y. pestis findings suggest researchers may have to choose between getting a good genome at a low cost or achieving the best possible genome.
For microbial forensics, where both speed and quality are extremely important, he explained, that may mean using a combination of 454 and Illumina sequencing combined with multiple assemblers and SNP detection tools — particularly since even a handful of false SNPs or indels can distract researchers from authentic variants.