By Monica Heger
This article has been updated from a version posted June 21 to include additional information about the initial assemblies of Ion Torrent reads.
Following the initial sequencing of the Escherichia coli O104 outbreak, which has killed 40 and sickened thousands in Europe, nine isolates have now been sequenced by five different teams on four different sequencing platforms, including the Ion Torrent, Illumina HiSeq, Roche's 454 GS Junior, and most recently the Illumina MiSeq.
A BGI team and a collaboration between Life Technologies and the University of Münster were the first to sequence the strain on the Ion Torrent PGM (IS 6/7/2011). BGI then sequenced the strain on the Illumina HiSeq, followed by the UK's Health Protection Agency, which sequenced the genome on Roche's 454 GS Junior (IS 6/14/2011).
Since then, a team from the Göttingen Genomics Laboratory has sequenced the E. coli strain with 454 technology, and the UK HPA provided five outbreak samples to Illumina, which then sequenced them on its MiSeq instrument. Data from the MiSeq and 454 runs are both available here.
The different groups have been making their data publicly available and researchers have started a crowdsourcing project to analyze and annotate the different assemblies.
The different sequence data and assemblies highlight the differences between the technologies and are also helping to trace the evolution of the outbreak.
Kathryn Holt, a postdoctoral research fellow at the department of microbiology and immunology at the University of Melbourne who has been following the outbreak and analyzing the different sequence and assembly data on her blog, said that as the assemblies have progressed, so have their quality.
Following the sequencing of the first genomes by the BGI and University of Münster/Life Technology teams on the Ion Torrent PGM, Nick Loman, a bioinformatician at the University of Birmingham, did the first de novo assembly using only BGI's PGM data from five runs. That initial de novo assembly produced "thousands of contigs with homopolymeric errors," said Holt. Life Technologies, meantime, took a different approach to the assembly of its PGM reads. Instead of doing a completely de novo assembly of the PGM reads, the researchers used a hybrid approach, first mapping to a reference genome, and then performing a de novo assembly of the reads that did not map, allowing them to reduce the number of contigs to 364.
The first two genomes were sequenced quickly, which Holt said was critical for patient care. The initial sequencing also determined that the bacteria contained a Shiga toxin producing gene, suggesting that it was enterohaemorrhagic E. coli (EHOC), but also contained some features of another type of E. coli known enteroaggregative E. coli (EAEC).
BGI's next assembly, which was a completely de novo assembly, but combined data from the HiSeq and Ion Torrent, produced 452 contigs. While this approach "allowed us to correct some of the homopolymeric errors" from the Ion Torrent reads, Holt said, the assembly still didn't allow researchers to parse chromosomes from plasmids, or to assign genes to specific locations.
The HPA assembly, meantime, clarified the location of resistance genes within the genome and helped researchers determine that the bacteria's Shiga toxin producing ability came from the integration of a phage genome into an EAEC strain.
HPA's assembly allowed for "the exact location and relationship between coding sequences" to be identified, said Saheer Gharbia, head of HPA's bioanalysis department. "Also, genetic exchange events and location of insertion elements [are] now apparent."
The latest BGI assembly, available here, essentially confirms the HPA's findings, clarifying that the hybrid strain is from the EAEC lineage with an acquired phage genome that produces the Shiga toxin.
BGI's final draft assembly contains no gaps. According to the institute, the E. coli O104 genome is composed of one circular chromosome 5,278 kilobase pairs in length, and three additional plasmids 88 kilobase pairs, 75 kilobase pairs and 1.5 kilobase pairs in size, respectively.
Whereas the initial sequences contained all the necessary information for the identification of genes associated with pathogenicity and resistance, the more complete assemblies will be important for figuring out where the particular strain "picked up those genes," Holt said.
As different groups release the reads from additional sequencing, Holt said that researchers could use the data to generate a meta-assembly. Having data from multiple platforms could help in producing the most accurate assembly, she said, because it should compensate for any errors that might arise from a single platform.
Additionally, because different isolates have been sequenced, the different genomes could be compared "to see if there are genuine differences and to see if there are mutations that have occurred during the outbreak," she said.
Have topics you'd like to see covered by In Sequence? Contact the editor at mheger [at] genomeweb [.] com.