NEW YORK (GenomeWeb) – Pacific Biosciences recently unveiled plans to change the output of its instrument and analysis software packages from the Hierarchical Data Format (HDF5) to the more commonly used Binary Alignment/Map (BAM) file format for both raw and aligned sequencing reads.
The change will affect both the software that runs on the PacBio RS II sequencing instrument as well as its SMRT analysis software for secondary analysis. PacBio has begun working on an updated version of the instrument software — which captures reads as they are generated and calls bases — that will produce BAM files as output, as well as on an update for the SMRT analysis software to both accept and produce BAM files, Tzvetana Kerelska, PacBio's senior product manager informatics, told GenomeWeb.
Although the company has not set an official date for when the change will occur, PacBio has been giving advance notice to customers and members of its DevNet community via various forums and announced the plan at its developer conference last month. The company has also posted details about the specifications for the new format so that users and developers can familiarize themselves with it and prep for the change.
PacBio is making the switch because of BAM's widespread use for genomic data and because improvements to the format over the years have made it more amenable for hosting various kinds of data that PacBio provides. The main reason the company initially chose to use HDF5, according to Kerelska, was because it wanted to provide additional kinds of data from its sequencing procedure beyond just base calls.
Besides base calls and associated quality scores, PacBio's primary software, which is coupled with the company's sequencing instrument, captures kinetic information — data related to the speed with which nucleotides are incorporated during sequencing. Kinetics data provides insights into epigenomic changes in the genome because it can help to detect chemical base modifications, Kerelska explained. The software also captures information about sequencing errors and data processing and provides quality scores for each insertion, deletion, and miss-call in the base-called data, she said.
When PacBio first brought its sequencers to market a few years ago, formats like BAM were not set up to record information beyond base calls, according to Kerelska. "There was really no standard ... nothing that we could use to record the data," she said. "At that time, we decided that the HDF5 type of file format [was] best suited for what we have and what we want to save and propagate throughout the analysis." But over time, BAM emerged as a standard format for genomic data and has matured to a point where it can hold the information that PacBio includes in its output in terms of both the aligned and unaligned data, Kerelska said.
There is a business benefit for PacBio as well. The company works closely with the open-source community and has several ongoing collaborations and projects focused on developing tools for its data. Adopting the standard that the community works and plays with "will further enable this relationship and will make processing our data much more straightforward for either third party tools or tools that are developed from [the] bioinformatics community," Kerelska said. "It makes it kind of a very natural decision to embrace this file format and use it."
What makes the difference?
Currently, the primary analysis software on PacBio's instrument produces bax.h5 files, which contain the base calls, quality scores, and kinetic data; and bas.h5 files, which contain the information necessary to assign the data to individual zero-mode waveguides (ZMW) by hole number. These files serve as inputs to the SMRT Analysis software, which handles the sequence mapping and variant calling and generates a cmp.h5 file if the user maps the data directly to a reference genome.
In an alternate experimental mode, a so-called circular consensus sequence (CCS) algorithm is applied to the data in bax.h5 files in an initial mapping step. The output from this step, a ccs.h5 file, then serves as the input to the SMRT Analysis software which — after alignment to the reference — generates the final cmp.h5 file.
With the change that PacBio is introducing, reads coming off the instrument — following signal processing and base-calling — will be collected in an unaligned BAM file and will include the kinetic information along with the various quality scores provided for the bases, Kerelska said. That will serve as the input to SMRT Analysis, which will produce the aligned BAM file. In other experiments, the BAM file is produced after the CCS algorithm has been run, and that serves as the input to the SMRT Analysis software.
"It's a much more simplified workflow in terms of the data and how the data looks like," she said, and it is more convenient for researchers looking to run other software applications post sequencing. They can either grab unaligned BAM files fresh off the sequencer or wait until after the SMRT Analysis software has run, she said.
PacBio will phase out HDF5's use gradually, allowing users time to transition to the new format, Kerelska said. The updated version of SMRT Analysis — version 3.0 of the software — will remain compatible with HDF5, in addition to being able to generate BAMs, giving users the option to process data in either file format. It will also include a converter tool that will enable users to convert bax.h5 files to the BAM format prior to processing. The updated software will also include enhancements that enable different modules to exchange data as BAM files. Other planned updates for SMRT 3.0 include a new dataset abstraction and associated reference APIs, scalability improvements for assembly and resequencing, and modelling improvements.
PacBio has built a community around its secondary and tertiary analysis portfolio, which includes the SMRT Analysis software, allowing external developers to contribute algorithms and methods to its platform. Generally speaking, the format change should not affect community-developed tools much, Kerelska said, as many of them already use BAM files, since third-party developers also work with sequence data from other instruments. If anything, the change should simplify analysis workflows and make them more efficient, she said.
Change can be good
At the recent PacBio developer conference, users seemed receptive to the planned change. "This was definitely something that the audience really listened to with great interest," Kerelska said.
Researchers can see both benefits and potential challenges with the switch. The HDF5 file format offers much richer mechanisms and syntax for capturing data and metadata than BAM does, and a main concern is whether the less detailed format offers a comparable level of data representation, Keith Robison, a computational biologist and author of the Omics! Omics! blog, told GenomeWeb. Robison wrote a post on his blog a few weeks back discussing the plethora of bioinformatics formats that litter the field. Besides PacBio, Oxford Nanopore also uses the HDF5 format to store data from its instrument. The company declined to comment on its use of the format for this article.
PacBio's BAM specifications, as they currently stand, provide insight into how the company plans to capture its more specialized datasets. "The mechanism they are using is the read tags field — they have created a whole bunch of specialized tags which will carry values for the read," Robison explained. "There are some clear hacks here. For example, they appear to be planning to fill the quality score with a standard value and then introduce several new quality metric vectors as tag-value pairs — thus retaining richness, but at the expense of just generating dummy information in the standard field."
Also, "you can see some other cases of cramming in here," he continued. "One issue with BAM is that it has just names for reads, so the relationships between reads or metadata about reads —in this case, which ZMW emitted the read — are encoded by yet another convention." In contrast, "in the Illumina world, there is a cacophony of different conventions for relating that read1 [and] read 2 are forward/reverse, and different tools balk at different conventions, and then new schemes, such as 10X Genomics don't fit in that at all," he said.
"The document also describes different information that simply won't be in the field BAM files," he added. "Some of these are available if written in 'PacBio internal' mode, others just haven't been worked out."
Because PacBio is still working on the specifications, its difficult for researchers to comment in great depth on their benefits and weaknesses.
Perhaps the strongest argument in BAM's favor is its ubiquity. "I would consider [BAM] an improvement for usability," Adam English, lead of production and R&D for sequencing informatics at Baylor College of Medicine's Human Genome Sequencing Center, told GenomeWeb in an email. "The HDF5 format is better at preserving complex data, but the BAM has a legacy of being the conventional format in bioinformatics [and] thus [has] much more support and understanding throughout the broader community."
Moreover, "the average analysis may have no use for — or at least no knowledge on how to leverage — IPD or pulse-width information, but much more commonly needs things like alignment information," English added. With BAM, "both types of information are available without the high barrier of entry present in HDF5."
"It's going to make it much easier to share data," Ali Bashir, an assistant professor in the Department of Genetics and Genomic Sciences at the Icahn School of Medicine at Mount Sinai, told GenomeWeb. "I think it's going to bring in many more researchers who previously were afraid to touch the data,'' he said, and it simplifies matters for regular PacBio users who use the BAM format for other tools. "It will be nice to have one less file that you are storing on the clusters, wasting space," he said.
In terms of the specifications, "the nice thing [PacBio] did was, they overloaded the BAM format to some extent, but they didn't do anything that broke the spec," Bashir said. "They haven't branched the BAM spec to support PacBio data. They've used all the fields available to them, but they are still keeping it within the BAM spec. That's going to be very useful, because now all the tools that are consistent with the BAM spec will work with it."
With the new format, users can expect some speed improvements, with a few exceptions. Because base values are not going to be indexed in the same way, operations in some applications — most likely in epigenetics — might run slower, Bashir said. However, tt will be faster to load datasets into a lot of existing infrastructure, which are largely developed with the BAM format in mind, he added.
There are also good toolsets for validating BAMs, which adds to the format's appeal, Robison noted. "One of the things I really like about XML and HDF5 [for instance] is, there is a way to test that the file is grammatically correct," he said. "You can write up a meta file that says what a correct file ... looks like, so I can validate a file." That's also true of BAM, which comes with toolsets that help enforce correct grammar, he said. Its a step up from formats like Fasta, which have more malleable definitions of grammatically correct inputs and outputs, making it challenging to standardized mechanisms for validating files produced in these formats.
But nothing is perfect. "There are multiple unknowns," English said. For one thing, "you'll have to consider factors like 'is this an identical [BAM] format that could be seamlessly plugged into existing tools?'" he explained. There is also a possibility that the PacBio BAM would not work well with existing visualization tools, such as IGV or parsers like Pysam, he added.
There could also be issues with file sizes, according to English. Cmp.h5 files are approximately the same size as the BAM files produced by normal SMRT Analysis protocols, however, the current BAM produced doesn't carry as much information as the proposed PacBio-BAM format, he said. "This extra information could really expand the BAM file's size and may increasingly become a concern as PacBio continues to improve [its] throughput."
For Bashir, one of the downsides is the transition cost. "There's going to be a period where [some] folks haven't transferred their pipelines over to the new format so you are going to have a little bit of fragmentation for a while," he said. But, "there's already fragmentation in the PacBio community because some people are already using a BAM-like format, so that’s not as big of a deal." Also, most community-developed tools can already work with other standard formats such BAM, FASTA, and FASTQ, he added.
Robison worried that some information may get lost in translation with the new BAM format, but Bashir said information loss might not be an issue. While he has not examined the PacBio specs in great detail, at least on the surface, he believes it captures all of the necessary information. "I'm someone who uses most of the features in the data, and I haven't felt that I'm losing particularly much of anything as of yet," he told GenomeWeb. It might not be oriented in the most efficient way for every operation ... but I don't think they are really losing all that much."