By Julia Karow
A year after receiving the PacBio RS single-molecule sequencer in their labs, early-access customers from two research institutions have taken their instruments through their paces and are starting to analyze "real" samples.
As users at the Wellcome Trust Sanger Institute and the National Cancer Institute become more familiar with the type of data the instrument produces, they expect to apply their machines more routinely this year for projects that can benefit from long reads.
Customers at both institutes told In Sequence recently that improvements in yield, accuracy, DNA input, and software top their wish lists for improvements, and they look forward to being able to use the PacBio platform to detect base modifications.
Both institutes received a pre-release instrument in November 2010 that was upgraded to the commercial version in mid-2011. They were among the last early-access users — a total of 11 — to obtain an instrument.
At the Sanger Institute, the first goal was "trying to see what it would do, absolutely everything that it could do," according to Paul Coupland, a postdoctoral researcher who has been in charge of running the instrument. This included testing the different modes of sequencing — standard, circular consensus, and strobe sequencing, which is now being replaced by long-read sequencing. It also included amplicon sequencing and "a lot of R&D work," he said. The majority of the Sanger's work on the PacBio RS is conducted in standard sequencing mode.
Only in the past few months has the Sanger team started to take "real" samples and generate data for investigators. "This is so new, people are almost scared" to hand over precious DNA samples for analysis on the PacBio, Coupland said. The data analysis also represents novel territory. "Even though we've had this for a year, we are still very much at the beginning of the workflow," he said. "It's easier to produce the data than it is to have people fully understand how to utilize the data."
Projects so far have included the improvement of genome assemblies, de novo assembly of microbial genomes less than 10 megabases in size, and amplicon sequencing of complex regions in various genomes.
Sanger has analyzed PacBio data both on its own and in combination with short-read data. Though no preferred mode of operation has emerged yet, "I think there will be an awful lot of power in hybrid assembly," Coupland said, depending on the nature of the genome in terms of size, repeats, and GC or AT content. One feature Sanger scientists have been "very pleased about" is the low bias in coverage of a genome, he added.
The institute has just started using PacBio's new C2 chemistry, due to be commercialized in the first quarter of this year, under early access. In a recent run, Sanger scientists achieved a mean read length of 2.5 to 2.9 kilobases with an accuracy of 85 percent to 86 percent. The longest mapped read had a length of 15,000 bases. Even though the number of very long reads is small, "the value of those reads is so great that people are really happy to have just a few of them," Coupland said.
The yield per SMRT cell has increased about 10 times over the last year, he said. Using phage lambda DNA, they have been able to get about 100 megabases of data and more than 50,000 mapped reads from a single SMRT cell.
The bottleneck for running the instrument constantly is the library preparation, he said, and Sanger is currently looking into setting up an automated library prep pipeline.
In terms of robustness of the hardware, the instrument has been performing "probably better than you would expect for a machine that is so incredibly complex," Coupland said.
The institute has not decided yet whether it will move the PacBio machine from the current R&D environment into high-throughput production, which will depend on factors such as demand from its faculty.
Meantime, at the sequencing facility of the Genetics and Genomics Group of the Advanced Technology Program at NCI, the PacBio RS has so far been used for amplicon sequencing of complex regions in mammalian genomes, as well as for some viral and bacterial sequencing. This involves both standard and circular consensus sequencing.
"We are in a trying-to get-the-technology-to-work mode," said Michael Smith, the group's director. The aim is to offer core facility services on the platform within the NCI, and "the early steps are to figure out what works well," he said.
There is "reasonable interest" among NCI investigators in PacBio amplicon sequencing as well as virus sequencing, he said, and he and his team are now working on completing projects with NCI researchers.
The jury is still out on whether PacBio data will be most useful on its own or in conjunction with short-read data, he said, but combining the two "makes a lot of sense."
Challenges so far have included "the normal hiccups" expected from an early instrument, Smith said, though NCI's experience has been no different than with other early platforms, for example the Illumina Genome Analyzer or "various [Applied Biosystems] machines in the past."
"The way to make the machine work well is still being worked out by the company and by us," he said. In addition, the nature of the data — in particular the fact that the accuracy is only around 85 percent — requires new bioinformatics tools that are only now coming online.
Smith said that the performance of the instrument at NCI, which does not yet have access to the C2 chemistry, has been "certainly in line" with PacBio's specifications, with an average read length of about 2 kilobases and about 50,000 reads per SMRT cell.
Smith's group has put "some work" into automating the PacBio library prep, and plans to put more effort into this once projects get bigger. For now, he said, manual library prep is easier.
For the Sanger researchers, an increase in yield tops their wish list for improvements of the PacBio machine because this would open the door for additional projects.
For example, a four-fold increase in yield per SMRT cell, to 500 megabases, would enable them to sequence the genome of the malaria parasite Plasmodium falciparum to about 20x coverage instead of only 5x. The P. falciparum genome is very AT-rich, which has been an issue for other sequencing technologies but not the PacBio, and Sanger's malaria team would benefit from a higher yield of the platform, according to Harold Swerdlow, head of sequencing technology at the Sanger Institute.
A separate group from the Sanger Institute has been working on a protocol on the Illumina platform to minimize bias when for sequencing AT-rich genomes such as that of P. falciparum (see story, this issue).
Higher accuracy would also enable additional applications. "We can definitely live with the accuracy that we have now, but higher accuracy would open up other opportunities," Swerdlow said. For example, he noted, some assemblies currently use both PacBio long reads and more accurate short reads, "and if you had highly accurate PacBio data, you would not have to do that."
Coupland added that PacBio data mainly contains indel errors, and very few substitution errors, noting that it is "pretty easy" to remove the indel errors "and get a conceptually higher accuracy read."
He also looks forward to using the platform to detect base modifications, such as methylation, and Sanger is working with PacBio on this application under early access. "If we could really begin to get single-base-resolution, real-time epigenetic information, that does open up a whole kind of new field of biology," Coupland said.
Analyzing base modifications is currently "very hard" to do with other technologies, Swerdlow said. However, human whole-genome methylation analysis — which is "what everybody wants to do" — is not yet possible with the platform's current throughput, he cautioned.
According to NCI's Smith, base modification analysis is "certainly a very attractive feature of the machine," and for some modifications, such as hydroxymethylcytosine, PacBio has "the only assay out there right now."
For NCI, wishes for improvements include access to the C2 chemistry, along with a reduction of the amount of starting DNA, which is currently on the order of micrograms. In addition, Smith would like to have better software. Data with 85 percent accuracy "is a challenge," he said, "but it is solvable."
Have topics you'd like to see covered in In Sequence? Contact the editor at jkarow [at] genomeweb [.] com.