Pacific Biosciences' single-molecule sequencing technology will be used to create assembled genomes of at least 1,000 pathogen genomes as part of the 100K Genome Project led by the University of California, Davis; Agilent Technologies; and the US Food and Drug Administration.
Bart Weimer, director of the 100K Genome Project and a professor at UC Davis, told In Sequence that the PacBio RS would be used to "finish completely 1,000 genomes" from five pathogenic bacteria: Salmonella, Campylobacter, Escherichia coli, Vibrio, and Listeria.
Initially, the technology will be used in combination with short-read sequencing data to create hybrid assemblies, but as PacBio's newer, longer read chemistry comes online, the researchers will switch to using the technology for de novo assembly.
Around 80 percent of the 1,000 genomes will be de novo assembled with PacBio, Weimer said. All of the data will be submitted to the National Center for Biotechnology Information.
Since it kicked off last July, the 100K Genomes Project has been sequencing pathogen genomes primarily with Illumina technology, including on the MiSeq as part of a larger pilot project to equip state laboratories with MiSeq instruments (IS 10/9/2012). The state laboratories plan to contribute sequence data from strains in their laboratories to the project.
However, "in complement to the short reads being produced, we want to create a database encyclopedia of really high-quality finished genomes of a broad scale from around the world that represent the pan-genomes of the top five to six major pathogens," said Weimer.
Marc Allard, an FDA research microbiologist and the research area coordinator of comparative genomics at FDA’s Center for Food Safety and Applied Nutrition, added via email that the agency would be using PacBio in the project to "rapidly close bacterial genomes."
To date, he said the agency has closed half a dozen Salmonella genomes and their mobile elements.
PacBio has been working on new sequencing technology, dubbed XL, that increases average read lengths of its system to more than 4,300 bases and throughput to between 200 and 250 megabases per SMRT cell (IS 11/13/2012).
Additionally it has developed a new assembly tool, HGap, for de novo assembly. HGap selects for the longest reads to create "seed reads," Jonas Korlach, PacBio's chief scientific officer explained to IS. Those seed reads are used to pre-assemble the genome, after which the shorter reads from the PacBio RS are aligned to the seed reads to build consensus.
The user first generates at least 60x to 120x coverage of the genome, and then sets a read-length threshold such that reads longer than the specified threshold generate around 20x coverage of the genome.
Only those reads are used to create the long, pre-assembled seed reads.
Then, the remainder of the sequence data is aligned to the seed reads to build consensus and error correct, Korlach said. This consensus-building step helps increase accuracy. In one example of a 9-kilobase long read, accuracy increased from around 85.7 percent to 99.3 percent, Korlach said.
Additionally, in data generated by the company from the sequencing of an E. coli genome, the company used eight SMRT cells for sequencing, and assembled the genome into two contigs with an N50 of 4.6 megabases and an accuracy of 99.9995 percent.
The process is also simpler than previous versions of assembly using only PacBio data, because unlike with circular consensus sequencing, only one library has to be prepared using this method, reducing the amount of DNA required by around half, Korlach said.
Weimer told In Sequence that UC Davis has been testing the new technologies from PacBio and so far the data "are looking quite good."
Building these five or six pan-genomes will be important because it will help in better "identifying organisms that are inside of an outbreak," he said. "As an outbreak occurs, we'll know at a detailed level, [if] this new isolate is part of the existing outbreak or outside of it."
Allard added that having complete, closed genomes are important for "providing a single complete picture of the genetic makeup of a pathogen."
Currently, this type of analysis is being done by looking at SNPs, but that does not always have a high enough resolution for use in an outbreak scenario. For instance, Weimer said, for organisms like E. coli, the "genomes of individual isolates are so different and variable." So in strains from the US, a specific gene is often used as a biomarker, but that marker is often missing in strains from Europe.
Instead, "as these genomes become available, people will use that information to create new analytical methodologies," he said.
Another reason Weimer said the consortium was interested in PacBio technology, aside from its long reads and new assembly applications, is its ability to directly detect epigenetic modifications, such as methylation.
Detecting epigenetic modifications on the PacBio does not require a separate sequencing experiment. Instead, base modifications can be detected by analyzing the kinetics of the system. As the polymerase incorporates nucleotides, there is a detectable pause if a base is modified. The company recently released software that helps users analyze these events (IS 7/3/2012).
"Ultimately, we'll have the methylome of these genomes to give us further insight — an additional axis of information to figure out how to be analytically correct," Weimer said.
Allard agreed that the epigenetic information the PacBio provides is critical.
"The system provides new information about methylated DNA sites," he said, which allows for "a number of restriction endonuclease and methyltransferase discoveries. So far each new serotype examined appears to have a unique methylation pattern when checked against the RefBase."
Weimer said he expects to generate the 1,000 de novo assembled genomes over the next three to four years. The bulk of the PacBio sequencing will likely be done at UC Davis, he said, but other member organizations, such as the FDA, will also be generating some of the data as will PacBio itself.