By Julia Karow
Several early-access users of the PacBio RS reported at this year's Advances in Genome Biology and Technology meeting in Marco Island, Fla., how the single-molecule real-time sequencing technology has been performing in their hands.
In addition, at a conference workshop, Pacific Biosciences reported on the performance of its system in seven out of its 11 early-access customer labs that had given permission to do so and said how it expects the instrument to perform at commercial launch in the second quarter.
PacBio introduced its sequencer at last year's AGBT. The company shipped its first beta instrument in July and has since upgraded the chemistry twice, chief technology officer Steve Turner reported during the workshop. As a result, the average read length at customer sites has increased from just over 500 bases to between 1,000 and 1,200 bases, depending on the length of the run, and PacBio has internally achieved an average read length of 1,500 bases. At launch, the average read length is expected to range between 850 and 1,500 bases, he said.
A fraction of the reads in each run is considerably longer: the top five percent of mapped reads averaged between 3,000 and 4,000 bases at the seven customer sites, and reached more than 6,000 bases in one customer's lab during a few runs. At launch, the top five percent of reads are expected to be between 2,000 and 3,000 bases long.
The system's single-molecule raw accuracy has also improved, from 82 percent to 85 percent internally at PacBio, while the seven beta customers currently average around 84 percent. At full commercial release, the accuracy is expected to reach 85 to 86 percent, and possibly up to 90 percent.
Single-molecule sequencing technologies have inherently higher raw read error rates than amplification-based sequencing technologies, Turner explained. But because the error is more uniform across reads than it is for short-read technologies, consensus accuracies will be "higher than ever been able before," he said.
The system's yield per SMRT cell has increased as well, from an initial 5 megabases to about 11 megabases internally, using cells with 45,000 zero-mode waveguides, of which about 10,000 produce sequence data. Customers currently average about 10 megabases per SMRT cell, but the output has varied significantly between runs at the same customer site, though this variability has "improved," Turner said.
Internally, PacBio has now switched over to 75,000-ZMW cells, which have doubled the output to 22 megabases, the expected output at launch.
However, the new SMRT cells actually contain 150,000 ZMWs that can be read sequentially in two sets of 75,000 ZMWs during the same run. This is expected to increase the output per cell to between 35 and 45 megabases at launch, and the number of reads per run to more than 35,000.
At launch, customers will be able to run between 12 and 24 SMRT cells per day, depending on the run time they choose, and the time from template preparation to base calls will be less than a day.
Starting in the fourth quarter and about every six months or so after that, PacBio plans to release new consumables kits that will increase read length, accuracy, and throughput.
To analyze PacBio's data, the company has developed a long-read aligner called BLASR, which supports read lengths between 20 bases and 200 kilobases. According to Jon Sorenson, the company's director of bioinformatics and computational biology, BLASR is currently able to align more than 200 megabases per hour to the human genome.
Most recently, the company has sequenced mitochondrial genomes, HLA regions, human fosmids, HIV, E. coli, and Vibrio cholerae, he reported, and "longer read lengths are allowing us to access larger and more complex genomes."
[ pagebreak ]
Customers Taking Control
Several early-access customers and collaborators presented their own experience with the PacBio RS during the meeting.
The Department of Energy's Joint Genome Institute, for example, received its instrument in mid-September and performed its first run in mid-October.
One application it wants to use the instrument for is finishing microbial genomes, according to Len Pennacchio, a senior staff scientist at JGI, who spoke during PacBio's workshop. He said the institute currently sequences "hundreds" of microbial genomes per year on the Illumina platform, using a combination of short-insert and long-insert paired-end reads, which costs about $5,000 per genome and results in about 100 to 200 contigs.
Genomes are currently finished by Sanger amplicon sequencing, at a cost of $30,000 per genome, or $4.5 million for 150 genomes per year.
A "major challenge" has been the non-uniform genome coverage by the short-read technologies, and the fact that repeat sequences are "hard to get to" with these, he said.
To test the PacBio platform, JGI sequenced four previously finished microbial genomes with a GC content ranging from 30 percent to 70 percent.
They generated about 7,500 reads per SMRT cell, with average read lengths improving from an initial 500 bases to 1,500 bases after a hardware and chemistry upgrade.
The top five percent of reads averaged 3.8 kilobases, and the longest read to date has been 6,637 bases.
The read accuracy was unaffected by the GC content of the genome, according to Pennacchio, which he said is "good news for assembly," and the accuracy appears to be fixed at about 85 percent throughout the length of the read.
For Brachyspira, a genome with only 28 percent GC content, the PacBio reads provided "near ideal coverage," he said, unlike reads from short-read technologies.
For a genome with 74-percent GC content, some regions were "a little underrepresented," he added, but covered much better than with short-read technology.
Overall, he said, the reads were "mostly uniformly distributed over the genome."
By combining short-read data with PacBio data for a hybrid assembly, he and his colleagues were able to reduce the number of scaffolds significantly for each of the four genomes, but this project is not yet finished.
He listed fast run times, long reads, a lack of amplification bias, a low cost per SMRT cell run of hundreds of dollars, and the possibility to sequence in multiple modes — standard sequencing, strobe sequencing, and circular consensus sequencing — as advantages of the PacBio platform.
There is still room for improvement, though. For example, the instrument currently requires micrograms of starting DNA, and the informatics tools for assembly and analysis are still works in progress, he said.
The current output also limits applications for mammalian genomes, and the cost per read or per base "cannot touch the massively parallel technologies." Finally, the cost of the instrument — almost $700,000 — is high, and a lower raw error rate would make the data more usable.
[ pagebreak ]
Gen-Probe, which partnered with PacBio last June on developing a clinical diagnostic sequencing system and invested $50 million in the company (IS 6/22/2010), also reported early results obtained from two collaborative pilot projects.
The company expects "sequencing to play a growing role in the molecular diagnostics lab," according to Gen-Probe's Matt Friedenberg, who presented results from the two projects during PacBio's workshop.
In particular, Gen-Probe sees applications for sequencing in transplant diagnostics, viral diagnostics, bacterial identification and characterization, inherited diseases, and cancer, he said.
The two companies have jointly evaluated the PacBio RS for deep sequencing of hepatitis C virus, both for genotyping and to detect minority species, and for sequencing the HLA class 1 region to call alleles based on the full genomic sequence.
Both projects started in late September 2010, with Gen-Probe preparing the samples and libraries and PacBio sequencing them. To capture HCV RNA from serum and plasma samples, Gen-Probe used the same method it employs in its blood screening assay, followed by RT-PCR of selected viral regions.
One of the HCV sequencing runs yielded about 12,000 reads and 16.5 megabases of data, with an average read length of about 1,400 bases and a raw read accuracy of 85 percent, which increased to a circular consensus accuracy of 99 percent based on three reads. The coverage was high — about 12,000-fold — which is essential to detect minority species, Friedenberg said.
The scientists were actually able to detect transcripts from two HCV subtypes mixed at ratios of 1:10 and 1:100 "at about the predicted ratios," he said, and could also detect a minority species in a complex clinical sample.
Though it took less than two days to get from a clinical sample to DNA sequence, this is still not fast enough for routine clinical testing, he said, and bioinformatics tools to analyze the data are still being refined.
For their second pilot project, Gen-Probe PCR-amplified the HLA class I loci, yielding PCR products ranging in size from 2.2 kilobases to 3.1 kilobases.
In order to be able to phase the results, the researchers excluded reads shorter than 1,500 bases, allowing them to obtain 68-fold coverage across each HLA locus. Some of the reads extended across the entire PCR amplicon, Friedenberg said, and those long reads were "extremely powerful to call the right allele."
Both alleles were always detected in each sequencing reaction, and the most prevalent allele was always correct, while the top two matches were correct in 12 out of 18 reactions.
Although these results are promising, he said, higher coverage with long reads will be necessary "to fully enable this application," and methods for calling alleles need to be further developed.
In addition, the current raw read error rate is "challenging" for HLA genotyping due to the large number of related alleles.
Besides microbial and clinical sequencing applications, researchers are also looking to use the PacBio platform for plant genome sequencing.
For example, Dick McCombie, a professor of Cold Spring Harbor Laboratory, said that he and his colleagues plan to sequence the 16-gigabase wheat genome using PacBio's strobe sequencing and standard sequencing data combined with high-coverage Illumina sequencing. While the PacBio will help them to scaffold the genome at the 10-kilobase and 1-kilobase level, the Illumina will provide highly accurate sequence.
Other PacBio beta-users are still at an early stage of testing their instruments. The Wellcome Trust Sanger Institute, for example, received its instrument at the end of November and "signed off on it" in mid-January, according to Harold Swerdlow, the institute's head of sequencing technology. Sanger researchers are currently testing the instrument with bacterial genomes of different GC content.
At the moment, the scientists obtain an average yield of 22 megabases in 40-minute runs, with an average read length of 1,500 bases and 85-percent raw read accuracy. Their best run has produced 28 megabases, with reads averaging 1,620 bases and an 86-percent accuracy.
Have topics you'd like to see covered in In Sequence? E-mail the editor at jkarow [at] genomeweb [.] com.