This article has been updated to include corrections to the previously reported sequencing modes of the PacBio platform as well as changes to the long form of the acronym BLASR.
As Pacific Biosciences prepares to ship its PacBio RS single-molecule real-time sequencing system later this quarter, the company has released SMRT Analysis, an open-source, secondary-analysis software suite designed to handle the system's long read data.
Kevin Corcoran, PacBio's senior vice president for systems research and development, told BioInform this week that by making the software available under an open-source license, the company hopes to "accelerate the development of software and informatics tools for third-generation sequence data."
The software suite is based on open-source and proprietary algorithms and includes web-based software to facilitate analysis, an analysis-pipeline framework, algorithms for alignment and de novo assembly, as well as a set of visualization tools.
Corcoran said that the firm developed the suite to cater to the new features that characterize its third-generation sequencing platform and the kinds of sequence data it produces.
These features include long read lengths, expected to range between 850 and 1,500 bases; high granularity, which makes it possible to run multiple samples at a time; two new sequencing modes in addition to standard sequencing; circular consensus and strobe sequencing; as well as kinetic information, which the firm says can provide data about modifications in DNA and RNA gathered during the sequencing process.
The company has already released APIs to enable its software to be integrated with other tools available via DevNet, its developer's network.
PacBio launched DevNet last summer to support third-party development of informatics tools and standards for its platform. In addition to APIs, the site provides access to data sets, source code, conversion tools, and documentation related to SMRT sequencing (BI 07/09/2010).
Currently, DevNet has between 800 and 1,000 members and interest is rising, Edwin Hauw, PacBio's senior product manager for software and informatics, told BioInform.
He also said that several software vendors are using resources on DevNet to develop commercial solutions that will be compatible with PacBio data. He did not name companies involved in these efforts.
Jon Sorenson, PacBio's director of secondary analysis, told BioInform that one of the "guiding principles" in the software-development process was to ensure that the data could be converted into standard industry formats from the company's internal format, which enables researchers to capture kinetic information.
The first component of the suite, SMRT Portal, is a browser-based application that lets users create, submit, and monitor secondary-analysis jobs and view and download the results in standard SAM/BAM and VCF formats.
The tool also includes algorithms to align reads to a reference sequence or assemble reads into a de novo genome. Users can manually set up their secondary-analysis pipeline through the portal or directly through a run design software dubbed RS Remote.
Underlying the secondary analysis capabilities of the suite is a Python-based framework called SMRT Pipe. Other tools in the suite include SMRT View, a genome browser that lets users visualize and interact with the data. It includes graphical representations of variants, quality values and other metrics, along with annotations.
Sorenson said that in addition to visualizing reads in a standard genome browser fashion, the visualization tool lets users layer unique features of PacBio's data such as kinetic information. Furthermore, he said, users can visualize data generated by the strobe and circular consensus sequencing modes on the platform, a capability which he said other browsers do not yet have.
Underlying SMRT Analysis are several algorithms, one of which is the Basic Local Alignment with Successive Refinement algorithm, or BLASR, which maps reads to genomes by finding the highest scoring local alignment or set of local alignments between the read and the genome.
Sorenson said the algorithm takes currently used approaches, such as suffix arrays and dynamic programming, and "puts them together in a way that hasn't been done before."
The initial set of candidate alignments is found by querying a pre-computed index of the reference genome, and then refined until only high scoring alignments are retained. The base assignment in alignments is optimized and scored using all available quality information, such as insertion and deletion quality values. Because alignment approximates an exhaustive search, alignment significance may be computed by comparing optimal alignment score to the distribution of all other significant alignment scores.
Sorenson said BLASR performed better than other long-read alignment algorithms, such as BWA-SW, which, under specified parameters, can align PacBio data. He also said that BLASR outperforms some well-known sequence alignment algorithms including Blast, MUMmer, Exonerate, and Blat, and noted that the firm plans to publish a paper providing specific benchmark details for BLASR as compared to other methods.
Another tool, Allora, short for "a long read assembler," is PacBio's de novo assembly algorithm. Based on the open source assembly software package AMOS as well as other components tailored to PacBio’s long reads and error profile, Allora uses an overlap-layout-consensus approach to iteratively assemble raw reads into contigs and then outputs them as Fasta sequence and cmp.h5 files.
PacBio also provides a hybrid assembly algorithm, named AHA, for hybrid de novo assembly. Sorenson explained that the tool allows users to create longer scaffolds from short contigs generated and assembled using data from sequencers such as the Illumina and SOLiD platforms as well fill in gaps in the sequence.
A final component of the suite, EviCons, produces consensus sequences from multiple sequence alignments generated from resequencing reads or contigs. The tool uses probabilities and a likelihood ratio test to separate alignments into regions of certainty and uncertainty and then uses base quality values and the Steiner framework to produce the best estimate of the local consensus sequence for uncertain regions.
Hauw said that participants in DevNet have provided some input in the development process of the analysis suite, although suggestions from early-access customers of the sequencing platform were more substantial than those from clients who had not used the software.
PacBio last year launched a partner program for vendors providing software, hardware, IT services, consumables, automation systems, and complementary instruments. Informatics partners include DNAnexus, Biomatters, DNASTAR, Genomatix, Amazon Web Services, BioTeam, CLC Bio, GenoLogics, GenomeQuest, and Geospiza (BI 02/19/2010).
While there is some overlap in terms of capabilities provided by participants in the company's partner program, Hauw said the company doesn’t see them as competition because these tools don’t offer the kind of support needed for third-generation sequencing.
Some third-party platforms are already compatible with PacBio data. For example, researchers from Pennsylvania State and Emory Universities have integrated the Galaxy platform with the company's sequencer and analysis software. In a video, Anton Nekrutenko, an associate professor of biochemistry and molecular biology and a member of the Galaxy team, provides a demonstration of the integration.
Hauw added that other open source programs are looking to support PacBio data but he declined to mention specific names, stating that these groups have not given permission for their names to be released.
Have topics you'd like to see covered in BioInform? Contact the editor at uthomas [at] genomeweb [.] com.