By Monica Heger
This story was originally published on May 13.
COLD SPRING HARBOR, NY — Early users of the Pacific Biosciences RS machine last week reported on their initial experiences with the system, including whole-genome sequencing of the C. elegans worm, strobe sequencing to improve assembly, and the hybrid assembly of the Rhodobacter genome with Illumina reads.
In a presentation at the Biology of Genomes meeting at Cold Spring Harbor Laboratory, Vincent Magrini, senior group leader of technology development at Washington University's Genome Institute, said that his group used two different sequencing methods on the RS — long-insert library prep and strobe sequencing in combination with Illumina sequencing to improve genome assembly. In another presentation, David Jaffe at the Broad Institute discussed how his team used the PacBio to fill gaps between Illumina-generated scaffolds of the Rhodobacter genome.
While the Wash U team found that PacBio's long-insert sequencing method produced the expected results, further work is needed to optimize the strobe sequencing method, which is still under development. Jaffe, meantime, said the Broad's work so far indicates that "it makes sense to marry the Illumina and PacBio."
'The Data Looks Good'
The Wash U team used an upgraded version of the RS, with 150,000 wells, or zero-mode waveguides, per SMRT cell, and chemistry that enabled read lengths of around 1,400 to 1,500 base pairs.
For whole-genome sequencing of C. elegans, They used a 2-kilobase insert library and sequenced to an average 30-fold coverage, obtaining between 28-fold and 33-fold for the various chromosomes. Subread accuracy was around 86 percent, and the average consensus accuracy was 99.96 percent, depending on the coverage. Chromosome 1, for instance, achieved the highest coverage, and therefore the highest consensus accuracy. Magrini said that the higher coverage suggested that there were repetitive regions on the chromosome.
For each 150,000 ZMW SMRT cell, two 75,000 ZMWs are run sequentially, producing two movies, or data outputs. While average read lengths increased from the first to second movie from around 1,400 base pairs to 1,500 base pairs, the yield decreased from around 35 megabases to 25 megabases. Magrini said that these differences had no overall impact on coverage, however.
In order to sequence the genome to 30-fold coverage, Magrini said that the team sequenced 330 SMRT cells and generated 419 movies. Based on current improvements to the machine, Magrini said they could use the same sample prep and sequence 120 SMRT cells, generating 240 movies. Additionally, "we are in the middle of another upgrade and expect additional improvements," he added.
Looking at the error rate, he said that it was consistent across the genome, suggesting that errors are random. Insertions were the most common type of error, followed by deletions and then mismatches.
He said that the machine demonstrated good coverage across the genome, but that there were still some biases in areas of high GC or high AT content.
"Overall, the data looks good," he concluded.
The team next tested the strobe sequencing method in combination with Illumina sequence data to improve assembly. Using the SOAPdenovo algorithm to assemble the C. elegans genome resulted in many contigs, said Magrini, so they combined that data with PacBio's strobe data to increase the scaffolding.
In theory, the strobe feature works by sequencing in bursts across one molecule of DNA: the strobe is turned on, begins sequencing, and is then turned off, but it continues to move across the DNA molecule even while it is off. This process occurs three times along one molecule, in theory generating three sets of reads that span a single, long molecule of DNA. That information can then be used to join contigs together.
However, when the team tried the method, around two-thirds of their strobe reads were singleton reads, so instead of three sets of reads spanning one molecule, there was only one read. Only 7 percent of the reads were triplets, while 26 percent were doublets.
Magrini said it is unclear why the majority of the strobe reads were singletons. He said that it could be that the DNA fragments were too short, such that the strobe may have reached the end of the fragment after the first sequencing step. Or, there could be a problem with the enzyme kinetics or quality filtering, both of which could impact strobe reads and result in single reads per molecule.
He added that the Wash U researchers have not yet evaluated the doublets and triplets that they did achieve, which could also provide clues as to why the method did not achieve the desired results.
Going forward, he said his team would continue to work with PacBio to "improve strobe library efficiency and to [better] understand the strobe mechanism."
He was unable to comment on the types of applications for which his team planned to use the machine, due to a non-disclosure agreement it has signed with PacBio.
'One Giant Contig'
Meanwhile, David Jaffe at the Broad Institute presented data on how his team used Illumina in combination with the RS to assemble bacterial genomes.
"Given the pluses and minuses of the technology, it makes sense to marry the Illumina and PacBio," he said. While the Illumina has a lower error rate than PacBio, it also has shorter read lengths, making assembly more difficult.
His team generated scaffolds for the Rhodobacter bacteria with an Illumina system , and then aligned those scaffolds to a PacBio read. Where there was a gap, they sequenced to high coverage with the PacBio in order to fill the gap and create longer scaffolds.
After sequencing the Rhodobacter genome with the Illumina platform, there were 22 contigs in one scaffold. However, after adding the PacBio reads, they were able to reduce that to "one giant contig," Jaffe said.
Adding the PacBio data does introduce errors at a rate of about 1 percent across the gaps, most of which are insertions, he said. In principle, it's possible to get rid of the errors by aligning Illumina reads back to the gap.
Going forward, he said the Broad Institute researchers want to work on maximizing the effective read lengths of the PacBio.
He added that the team would be interested in using the machine for bacterial genome assemblies and also to validate findings, for instance, in a clinical setting where a physician would need rapid feedback on a specific mutation.
Have topics you'd like to see covered by In Sequence? Contact the editor at mheger [at] genomeweb [.] com.