The 100K Genome Project has added 20 newly completed genome sequences of foodborne pathogens generated using Pacific Biosciences' single-molecule sequencing technology to the National Center for Biotechnology Information's public database.
The project, which is led by the University of California, Davis; Agilent Technologies; and the US Food and Drug Administration, aims to sequence 100,000 bacterial and viral genomes overall. It adopted the PacBio RS at the start of this year, with plans to use the technology to assemble at least 1,000 complete genomes from five pathogenic bacteria species: Salmonella, Campylobacter, Escherichia coli, Vibrio, and Listeria.
Bart Weimer, the project's director and a professor at UC Davis, told In Sequence this week that these first 20 genomes represent a gearing-up phase for the project, in which his team worked through the logistics of automating its PacBio sequencing protocol for an acceleration of the project over the next year.
He said PacBio sequencing has been "fantastic" for the team so far, yielding "nice solid data," that has been easy for the group to turn around.
"We've also worked through some of the logistics of getting it loaded into the public domain, which are quite extensive because the amount of information the PacBio platform is providing is more extensive than what NCBI is used to handling," Weimer added.
"The first 20 were a trial run," he said. "And now we are at a place where we have been able to automate and streamline the process. The workflow is completely put in place to where we just finished doing short reads of 1,500 isolates, and that was done in three months. So now it's about shifting into third gear and [processing] more isolates."
Accordingly, Weimer said, the team is planning to release a second, much larger set of completed genomes this fall.
Initially, the project is using the PacBio system in combination with short-read sequencing data from other platforms to create hybrid assemblies, but the team said this past January that as PacBio's newer, longer-read chemistry came online, it would switch to using the technology for de novo assembly (IS 1/15/2013).
While Weimer said PacBio sequencing has been easy for the group to adopt, although library construction was an initial hurdle, including choosing and optimizing kits that were available, standardizing, and automating the approach to get consistent quality.
Jonas Korlach, PacBio's chief scientific officer, told In Sequence that PacBio and the 100K genomes team have been working with several vendors to adapt fluidic robot technology to automate library preparation steps.
He also said the team has benefitted, since it began working with PacBio in January, from recent upgrades in platform hardware and chemistry.
"On the hardware side, the PacBio RS II [now] doubles the throughput per SMRT cell, so you only need half the cells to close a bacterial genome," Korlach said. The RS II, which was launched in March, increases throughput over the initial PacBio system to around 500 megabases per SMRT cell.
"Then, with our chemistry upgrade, it has a higher single-pass accuracy, which insures that you can get a high-quality genome at lower coverage," Korlach added.
A third advance, which was not in play for the 20 genomes the 100K project just released — but which Korlach said is being incorporated into the workflow moving forward — is upfront DNA size selection using Sage Science's BluePippin platform, which PacBio agreed to co-market earlier this year.
Weimer said that the chemistry upgrade has been particularly helpful, as well as the release of PacBio's HGap de novo assembly tool, which selects for the longest available reads to create seed reads used to pre-assemble the genome, to which shorter reads are aligned to build consensus.
"We are one of the first groups to use HGap, and we started using it before they published it," Weimer said. "It has really been a huge improvement as well."
More than the sequencing, Weimer said that the main issue for the group in this first phase was integrating the PacBio data into the NCBI's database.
"These finished genomes and the mass of information is not something they are typically seeing in this quality," Korlach explained. "Also the data structure is a little different from Sanger and second-generation sequencing because we are providing DNA-base modification information — epigenetic information — along with sequence information."
"So, we have been working with NCBI to incorporate that and make it available in their genome browsers, but there is still some work we have to do to really optimize that — to make it efficient," he said.
Moving forward, Weimer said the team is now looking to expand its sequencing efforts dramatically — seeking partners that have capacity to do sequencing in a "robust enough fashion" to keep up with the project's goals.
He didn't mention who these potential partners might be, but said the project is considering several.
"Also, what we are going to be doing in the next year," he said, "is clarifying the strategy of what isolates to sequence when [and] making a ranking system."
"So far we've been taking all comers and sequencing whatever came in because we needed to get the pipeline in place. But year two and three will be about a strategy to fill in some of the genomic gaps we have and then fill in some of the parts of the world we haven't sequenced yet," Weimer said.