By Monica Heger
This article has been updated from a version posted July 8 to clarify that the HPA 454 assembly was de novo.
Pacific Biosciences, which remained quiet over the last month as every other next-gen sequencing vendor released sequencing data from the Escherichia coli O104 outbreak, has now sequenced the strain along with 11 other isolates.
The company said it performed a completely de novo assembly of the O104 strain using improved chemistry that it plans to incorporate into an upgrade for its PacBio RS system, which it plan to release in the fourth quarter.
Specifically, PacBio said it has extended its average read length to 2,900 bases and incorporated circular consensus sequencing to obtain a de novo assembly with 99.998-percent consensus accuracy. Sample prep through sequencing took on average fewer than eight hours for each sample, the company said.
The data, available through the company's website, is an improvement over what the system was obtaining at its April launch, as well as performance reported by early customers in May, who obtained between 1,400 and 1,500 average read lengths (IS 5/17/2011).
PacBio, which was the first to sequence the Haitian cholera strain last year (IS 12/14/2010), is the last of the major sequencing companies to sequence the E. coli outbreak strain, which has served as a testing ground for new instruments like the Ion Torrent PGM and the Illumina MiSeq.
PacBio chief scientific officer Eric Schadt told In Sequence that the company has been busy with its commercial launch, but eventually decided to sequence the strain because he thought it could contribute to the assembly.
The company launched the RS system this spring, shipping its first two instruments in May (IS 5/3/2011) following an early-access program at 11 customer sites.
With the release of the E. coli sequencing data, the company demonstrated improvements it has made to its chemistry since the launch — upgrades it plans to incorporate into the RS in the fourth quarter.
The PacBio team sequenced both the outbreak strain as well as 11 other isolates, six of which were from the same serotype but had not been sequenced previously. For the outbreak strain, the researchers constructed two types of sequencing libraries: a long, 9-kilobase insert library and a circular consensus library.
For the 9-kilobase library, the team achieved average read lengths of 2,900 base pairs, and sequenced to 200-fold coverage. Because of the longer read lengths, throughput increased two-fold to 90 megabases per SMRT cell. In addition, 5 percent of the reads were longer than 5,100 bases, with the longest read around 22,000 bases.
The long reads enabled a complete de novo assembly using only PacBio data, said Schadt.
"The hope was, in generating coverage, we would increase the number of super long reads to span the gaps and get to as complete an assembly as possible," he told In Sequence.
The long-read sequencing was done to generate scaffolds, but because it represents single-pass sequencing of a single molecule, there is a higher error rate than for short read sequencing. Average raw-read accuracy was around 85 percent.
The team then employed circular consensus sequencing for error correction. Circular consensus sequencing involves creating shorter fragments, around 500 base pairs, and circularizing them.
For this library, average read lengths were 430 base pairs, and each circular fragment was sequenced on average six times, which yielded around a 30-fold coverage of the entire genome.
After using the long-read data to generate scaffolds, the team layered on the data generated from the circular consensus sequencing, which averaged 97.8 percent accuracy.
The final assembly has a consensus accuracy of 99.998 percent and is comprised of 33 contigs that cover the bacterial chromosome, and four additional contigs covering the two plasmids. The N50 contig size is 402 kilobases, with the largest contig at 654 kilobases.
[ pagebreak ]
By comparison, Justin Johnson, bioinformatics director at EdgeBio, recently assembled O104 data from the Ion Torrent PGM and achieved an N50 of 50,000 base pairs and 173 contigs. In addition, he assembled the strain genome on Illumina's MiSeq and generated an N50 of around 95,000 base pairs and 117 contigs (7/5/2011).
While other groups were able to do assemblies of the outbreak strain in fewer contigs, some of those groups used either a reference-guided approach, a fosmid library, or PCR sequencing of the ends to achieve those assemblies, Schadt said. The exception was a de novo assembly from the UK's Health Protection Agency using Roche's 454 GS Junior, which had 13 contigs.
For the other 11 E. coli strains, Schadt said the team used only the long insert libraries, and then aligned those reads to the outbreak assembly to perform comparative genomics.
Of the 11 other strains, six had the same serotype as the outbreak strain but did not contain the Shiga toxin. While the team is still analyzing the data, Schadt said the goal is to look for structural changes between the strains.
Sequencing the other strains, in particular the six of the same serotype, will help "understand to what extent there are systematic differences" between the outbreak strain and the other strains of the same serotype "that may elucidate what are the key players and relatedness from an evolutionary standpoint of the outbreak strain" that make it so much more toxic, he added.
While all the sequencing was done by PacBio, the company collaborated with several research groups, including a team from the University of Maryland, which helped with the assembly and comparative genomics of other strains.
Dave Rasko, an assistant professor of microbiology at the University of Maryland's Institute for Genome Sciences who collaborated with the PacBio team, said that PacBio's longer reads "helped solve the genome structure" of the strains.
For instance, he said, the de novo assembly enabled the team to figure out where the phage that encodes the Shiga toxin had inserted itself into the genome. "We can be more confident in terms of our assembly" because it is de novo and not reference based, he said.
Additionally, the "quoted error rate is somewhat disingenuous," he said. Even though the raw read error rate is high, once the circular consensus sequencing is added, the accuracy is greater than 99.99 percent, which "puts it on par with any other sequencing technology."
One aspect of PacBio's technology that was not used in the E. coli sequencing was its strobe sequencing capability, which sequences in bursts across one molecule of DNA to help in assembly by joining together contigs. Schadt said that in this case, the read lengths were already long enough that the addition of strobe sequencing would not have helped.
Already, with the 9-kilobase insert library, "we were bumping up against the state of the art" in DNA fragmentation technology, he said. "We couldn't get fragment lengths reliably over 10 kilobases."
Strobe sequencing was beneficial when the company's read lengths were shorter, but now, "in order for the strobe to be beneficial, we would have wanted to span 20-kilobase chunks," he said.
While the strobe sequencing would not have improved the assembly, it still has other applications, he said. For instance, it could be employed to decrease turnaround time.
Following the release of the data, William Blair analyst Amanda Murphy issued a research note that cited the improvements the company has made, yet maintained the investment bank's 'Market Perform' rating.
"The longer-term adoption curve (i.e., post-2012) will rely on the company's ability to continue to improve specifications over time," she wrote.
Murphy added that "there seems to be a lot of excitement around the RS's potential in the research community — particularly around the long read lengths." However, due to its low throughput — around 70 times lower than Illumina's HiSeq 2000 — it currently occupies a "niche in the market" for researchers looking for long read lengths for applications such as bacterial sequencing and targeted resequencing.
Additionally, due to its rapid turnaround time, the system has "interesting implications for infectious disease identification and monitoring," Murphy wrote.
Schadt said that infectious disease was an area in particular that the company would focus on, along with de novo assembly and targeted resequencing of medically relevant genes. The long read lengths will resolve both "small DNA variations, but also larger structural variations that are very important to the biology of infectious systems."
Additionally, he said he expects the machine's speed and flexibility to have "utility in the targeted gene sequencing space," particularly for cancer diagnostics and therapy.
Have topics you'd like to see covered by In Sequence? Contact the editor at mheger [at] genomeweb [.] com.